Show HN: Scryer - deep search for video and podcast content

HanClinto · on Feb 15, 2022

This is great! Literally just yesterday I was looking for something like this.

I've used the Chrome extension YouTube Captions Search ( https://chrome.google.com/webstore/detail/youtube-captions-s... ) and that works quite well... but only for one video at a time. I was just thinking that it would be great if there were a public repository of YouTube CC content that we could search through, and lo-and-behold -- this pops up on my feed today.

Very nicely done with this! Are you using the YouTube-provided CC files as a data source for this, or are you creating your own transcripts from scratch?

prohobo · on Feb 15, 2022

Thanks, actually I'm looking for any feedback I can get on this and then make it as open source as possible (while still earning something) so people can index their own content feeds and use it.

Right now, yes we're taking the transcripts from Google - just because it's easy. But I'll implement our own transcriber once we start incorporating other sources.

HanClinto · on Feb 15, 2022

Yeah, that makes a lot of sense!

Is there any way that we can contribute? Like, if I watch a video with CC turned on, it feels like it would be easy for me to upload the CC file(s) to a central server, keyed with the ID of the video. Any user watching a video once would be enough to add a video to the database, and we could crowdsource indexing this content to help burden some of the load?

prohobo · on Feb 15, 2022

The main bottleneck is the machine learning to extract metadata, the transcriptions are actually handled very fast at an early stage.

Someone mentioned that it might be a good idea to create a federated network where people just link their feeds and share them with each other on demand, which I have no real experience with but I think could work quite well.

I've only been working on this for a few months, so this is all still very experimental and there's no way I can build something like that in a short amount of time, but we'll think of something that makes sense and gets people as much access as possible.

HanClinto · on Feb 16, 2022

What sort of metadata are you extracting with machine learning? Doing more than simple keyword indexing? I'll admit, I don't know very much about the particulars of building a large search engine.

How large is the ML model that you're using to extract the metadata? If that can be done in a distributed way, then it might let people pre-process their own feeds and share them with others -- either on-demand, like you mentioned -- or caching them centrally so that your servers are still the authoritative repository.

Anyways, everything you've built here is wonderful, and I definitely want to keep tabs on your progress and support you however I can! Thank you for sharing!

prohobo · on Feb 22, 2022

Hey sorry I didn't see this reply until now!

We take the transcripts and run natural language processing models on them to find keywords, n-grams, and pronouns. We then categorize the pronouns according to type (person, organization, media, location, event, etc.)

The search engine itself is just transcripts in Elasticsearch.

From the feedback I've gotten so far, I think the best thing to do is actually integrate this with YouTube through a browser extension, and have users call the backend to do this processing on-demand (with caching of course). It seems the video breakdowns are more important than searching for most people, since with YouTube you can do some kind of content searching, just not very transparently.

prohobo · on Feb 15, 2022

Heya, I'm working on creating a metadata search engine (using transcripts) for YouTube, Spotify, etc. Basically any video/audio content.

Everyone always complains about how bad multimedia search tools are, so this is something that can address that. Some people tried to do stuff like this back in 2006 (Blinkx) but never made it very far. This takes those attempts a step further than basic transcript searching, and uses machine learning to extract topics, entities, and keywords that can be used to filter on the data sets - as well as show "gists" of the content in any piece of media.

It's very much a work in progress. Currently it doesn't have a mobile layout, it's buggy, and it's quite - uh - "overwhelming", but hopefully someone can see the utility of it.

polyterative · on Feb 15, 2022

cool stuff