Lip Reading as a Service (Read Their Lips by Symphonic Labs)

luma · 2024-09-10T20:03:01 1725998581

Thinking through some potentially interesting sources for videos where two people are talking but we don't know what was said and well, I think this is a decent starting point: https://www.youtube.com/watch?v=KLcfpU2cubo

Sadly, doesn't work too great in this situation:

> That they didnt go through but i would tell you theyre just a chill look at here lets do it chills with all of our great men and they look at every chance they go oh do you want to the black man well thats my gosh thats my gosh thats my gosh thats my gosh thats my gosh thats my gosh thats

mtVessel · 2024-09-10T23:00:03 1726009203

So far, it's no HAL 9000.

Uploaded video dialog:

Bowman: You know, of course, though, he's right about the 9000 series having a perfect operational record. They do.

Poole: Unfortunately that sounds a little like famous last words.

Bowman: Yeah, still it was his idea to carry out the failure mode analysis, wasn't it?

Poole: mmm

Bowman: Should certainly indicate... (away from camera): his integrity and self-confidence

Bowman: If he were wrong, it'd be the surest way of proving it.

Poole: It would be if he knew he was wrong.

Results:

"Of course there is recommended getting necessary to have a perfect operational rank i know youre going to be the first to do that youre going to get the best youre going to get the best youre going to get the best youre going to get the best youre going to get the best of yours if you want to rock better sure its well perfect."

coolandsmartrr · 2024-09-10T23:36:09 1726011369

I think you just proposed the ideal benchmark material for all future AI lip readings.

pogue · 2024-09-10T20:01:18 1725998478

Has anyone tried this with some video where they know what the person is saying?

I'd be interested to know how accurate it is, from what angles it will read lips at (front facing, side, etc).

Sounds promising if it works well. Imagine all the historical videos without sound you could try to finally know what was being said.

bluGill · 2024-09-10T21:15:19 1726002919

Experienced lip readers are lucky to get half of what is said. Better than nothing but not reliable enough for anything and so better to use something else if possible.

'i love you' and 'island view' have the same lip movements is the clasical example.

serf · 2024-09-10T22:00:02 1726005602

absolutely right.

my mother was a mostly-deaf lip-reader. She needed conversational context in order to keep up 'legibly'; and it created a lot of fun between the two of us when she would come up with an oddball question or comment that had nothing to do with the conversation once-in-awhile when her guesses failed spectacularly.

With context, though, it's a great tool. She and I used to watch crime dramas with the sound off late at night and never miss a beat. It feels if you're trying to transcribe something that has a lot of structural context the success rate is higher than 50%, but I don't know that formally.

It's still a tool I use in conversation. Even with good hearing it's tough to hear people in crowded restaurants or concert venues, lip-reading helps immensely.

pogue · 2024-09-12T09:21:09 1726132869

What made me think of this was a documentary I saw years ago called "Hitler's Private World" where they used a lip reader and some video enhancements (I don't exactly remember) to be able to read his lips that didn't contain audio, such as those taken at his private villa. I don't know if what they found was ever vetted for accuracy, but it was quite amazing to be able to use lip reading to try and determine what all was being discussed privately.

I found a copy of it on Dailymotion [1] and a brief description [2]. It's well worth a watch! I always wondered why they didn't use these techniques on other recordings of video that have some mystery surrounding them.

[1] https://www.dailymotion.com/video/xlvimo

[2] https://m.imdb.com/title/tt1193023/

echelon · 2024-09-10T19:48:43 1725997723

Great way to build labeled training data.

User-submitted videos (with audio for STT), user-crafted bounding boxes (we might not need these soon), and user-guided RLHF.

The submitted videos are likely diverse, challenging (otherwise the human might just do it), and representative of solving actual customer problems.

indoordin0saur · 2024-09-10T20:14:55 1725999295

Doesn't even need to be user guided. Use videos that have audio. You could have one AI that generates a transcript using the audio/video and another that watches the video on mute and tries to read the lips. Feedback would then be provided by the AI that had access to the audio.

0cf8612b2e1e · 2024-09-10T20:19:59 1725999599

I am thinking of the millions of hours of tv news. Presenters are almost always going to be the same position in frame and may already have high quality transcripts.

shrubble · 2024-09-10T20:31:48 1726000308

Wondering how well this will perform on the viral video “Benny Lava” and if it will be part of a group of videos used to create a synthetic benchmark.

tchock23 · 2024-09-10T20:06:54 1725998814

Has anyone tried this out on Radiohead's "Just" video yet?