OpenAI announced from day 1 that GPT-4 is multimodal, so it was mostly waiting for safety censorship and enough GPUs to be available for mass rollout.
This won't entirely replace human volunteers, but these models get rapidly better over time. What you are seeing today is a mere toy compared to the multimodals you'll get in the future.
Currently there's no model trained on videos, due to large size of videos, but in the future there will be video-capable models, which means they can understand and interpret motion and physics. Put that in a smart glass, and it can act as live-eyes to navigate a busy street. Granted this will take years to bring the costs down to make that viable.
We had enough drama with BeMyAI refusing to recognize faces (including faces of famous people) as it were. If sighted people have the right of accessing porn and sexting, why shouldn't we? Who should dictate what content is "appropriate", and what about cultures with different opinions on the subject?
> Who should dictate what content is "appropriate", and what about cultures with different opinions on the subject?
OpenAI should dictate that, because GPT 4 belongs to them. So they decide what kind of service they're interested in offering.
There will be plenty of other powerful LLMs that can be used in the near future. Some will be more restrictive, some will be less. If you want fewer restrictions, you will be able to pick one that offers that for you.
> There will be plenty of other powerful LLMs that can be used in the near future.
Extremely optimistic take. What tends to happen is you get centralisation, and regulatory capture ensures the largest players dictate what is an acceptable to the incumbents who wish to do things differently.
I mean in theory you can go set up your own social network or video sharing site with whatever rules you like, but you should assume government regulators and big tech will attack you if you do so and believe in the principles of free speech, or simply wish to create a safe-space for conspiracy theorists.
>> the future there will be video-capable models, which means they can understand and interpret motion and physics.
Videos may not suffice. Videos are 2d, with 3d aspects being inferred from that 2d data, which is an issue for autonomous driving based on cameras. A proper model for AI training would be 3d scans rather than videos. The best data set would be a combination of video and 3d scanning. Self-driving cars which might combine video with radar/laser scanning may one day provide such a data set.
There is talk of a 3d version of Google streetview, one using a pair of cameras to allow true VR viewing. That might also be good training data as it will capture, in 3d, may street scenes as they unfold.
I’ve actually been fairly impressed with how far monocular depth estimation has come in a few years. I think that it should not be relied upon in self driving cars because they need closer to 100% accuracy and also almost no latency, and the SoTA ones are too slow to be run every frame currently. Not only that but it’s a life-or-death situation where cutting on accuracy % to save BoM costs seems ridiculous to me.
But in a higher-acceptable-latency, lower-risk environment like this, I am actually quite bullish on camera alone methods. Video understanding has come a long way.
This won't entirely replace human volunteers, but these models get rapidly better over time. What you are seeing today is a mere toy compared to the multimodals you'll get in the future.
Currently there's no model trained on videos, due to large size of videos, but in the future there will be video-capable models, which means they can understand and interpret motion and physics. Put that in a smart glass, and it can act as live-eyes to navigate a busy street. Granted this will take years to bring the costs down to make that viable.