Doesn't even need to be old movies. Certain types of video content in the US is ...

nostrademons · on March 23, 2016

Much of the YouTube content has auto-generated subtitles, i.e. Google is running their speech-recognition software on the audio stream and then using that to caption the video. If you used that as your training set, you're effectively training on the output of an AI. Which is kind of a clever way to get information from Google to your open-source library, but will necessarily be lower-fidelity than just using the Google API directly.

_puk · on March 24, 2016

In the US, if it's ever been played out on broadcast TV then it must have Closed Captions.

This is enforced by the FCC [0], but as more and more "internet" content gets consumed I imagine the same regulations will eventually come, at which point you've got a fantastic training set.

0: https://www.fcc.gov/node/23883