I made something a little similar, but just as a little cli script that I run locally for myself. You can input a url for a YouTube video, podcast link or local audio/video file. It transcribes it with whisper and outputs the full transcript in one text file and I use another model to summarize it into a bullet list in a separate file.
I so appreciate these open source/access models allowing us to build these kinds of tools without having to pay and send our data to openai.
Whisper is a different company than Youtube (Google). Youtube's transcription existed before Whisper too so I'd suspect Google has their own for some time.
Whisper's is supposed to be better in some cases, but Google's probably works very well at scale.
I so appreciate these open source/access models allowing us to build these kinds of tools without having to pay and send our data to openai.