There are three main parts to note: designing the videos, the backend for handling media, and the download renderer.
The video designing process is a React app that leverages a Ruby on Rails based backend API. The React app handles the views, and if you look at the UI, you will see how the app steps are persisted in the URL. Our React app is based on Redux Toolkit, which is phenomenal. The Rails application is a normal API with Sidekiq (aka Redis) workers, which handle asynchronous tasks.
Our UI renders a number of interesting elements which are respectively generated/prepared in various service oriented streams. The most important is our transcript API, which is from https://www.AssemblyAI.com - they are the best, cheapest, highest quality and best development experience transcription tool. We also have a series of Lambda functions that handle uploaded audio/video file prep, such that we can encode files in a unified format and parse out the audio data needed for visualizing things like the audio waveform or the animated audio frequency data.
A few interesting tidbits are that we use Lambda Layers extensively. We have functions written Ruby and JavaScript where we move the vendor Gems or node_modules into a shared Lambda Layer, and then we also use EFS to run Python based functions that have dependencies that are too big for the Lambda itself.
Our video rendering is also pretty neat, in that we leverage the browser as a rendering view, and batch process screenshots of each frame of the final output video, using a AWS-based container orchestration process.
In summary, this tool is based on the entire work we've been doing for our larger company. Since we can leverage that, we are spinning up a number of single purpose utility projects based on what customers ask for.
If this is interesting at all, we are hiring advanced Javascript engineers who are comfortable learning new things.
You can reach me at lenny@milkvideo.com or if you just want to chat to learn more/ask questions, please feel free to book time here: https://calendly.com/rememberlenny/15-min
I was primarily curious about the video generation system. There are a number of different ways to generate programmatic video - ranging from pure ffmpeg + image and text layers, to some kind of headless browser that spits out frames rendered by a canvas and then a ffmpeg or equivalent process to aggregate them into a video. There are also python libraries like moviepy which offer their own API to create layers. Each of these has their own different levels of performance, and I was curious about whether there is a de-facto best approach for this sort of thing where someone had evaluated all of these options and settled on one after looking at all the tradeoffs.