Wait, but we're doing that already, and it works well (Qwen 2.5 VL)? If need be, you can always resort to structured generation to enforce schema conformity?
In the demo, O1 implements an incorrect version of the "squirrel finder" game?
The instructions state that the squirrel icon should spawn after three seconds,
yet it spawns immediately in the first game (also noted by the guy doing the demo).
Yeah, now that you mention it I also see that. It was clearly meant to spawn after 3 seconds. Seems on successive attempts it also doesn't quite wait 3 seconds.
I'm kind of curious if they did a little bit of editing on that one. Almost seems like the time it takes for the squirrel to spawn is random.
If you don't have to pay for child care because you can just take off the time to pick up your kids from school, you are saving money your job otherwise forced you to spend.
Being forced to commute at peak time costs more - on a retail salary the difference between a peak and off-peak ticket can mean that the first hour or two of working is essentially pointless, as you're just paying back the cost of getting there in the first place.
Having to pay an extra surcharge to visit the dentist, because you can only go on a Saturday because you need to be at work other days. Flexible working would allow you to just take the Tuesday off no problem and go when it's cheaper.
I'm sure there are lots of other examples that apply to different lifestyles.
hours with your child aren't fungible. you can't pay the babysitter to go see the dance recital for you if you want to be the parent instead of the babysitter being the parent. all the money in the world isn't going to make up for missing the soccer game where your kid makes the winning goal.
Well to be pedantic, with all the money in the world you wouldn't be working for Ikea and the problem wouldn't exist so really that's a problem also solved by money...
Generally though higher paid employees tend to have more sway within a company structure and likely don't need to miss these important events, the win here is that something that was generally true for mid management up for most companies now extends down through all the ranks.
You can manage things in your life when they occur instead of spending money to displace them or risk losing your job because of them.
In another way, if you present a worker with the option between two jobs with the same hourly rate, one having flexible working hours and the other not, which would expect to be more likely choice? You can then measure the value of this choice by changing the hourly rates between the two until you see changes in outcome and you would be able to estimate exactly how much "more money" it appears to be "worth."
It looks like Runpod currently (checked right now) has "Low" availability of 8x MI300 SXM (8x$4.89/h), H100 NVL (8x$4.39/h), and H100 (8x$4.69/h) nodes for anyone w/ some time to kill that wants to give the shootout a try.
You're joking/trolling right? There are literally 10's of thousands of H100s available on gpulist right now, does that mean there's no cloud demand for Nvidia gpus? (I notice from your comment history that you seem to be some sort of bizarre NVDA stan account, but come on, be serious)
In Mixtral 8x7B, the 8 means that the model uses Mixture-of-Experts (MoE) layers with 8 experts. The 7B means that if you were to remove 7 of the 8 experts in each layer, then you would end up with a 7B model (which would have exactly the same architecture as Mistral 7B). Therefore, a 1x7B model has 7B params. An 8x7B model has 1 * 7B + (8-1) * sz_expert params, where sz_expert is some constant value that the MoE layers increase by when adding one expert. In the case of Mixtral 8x7B the model size is 46.3GB, so, sz_expert ≈ 5.6B.
If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.
People here seem mostly impressed by the high resolution of these examples.
Based on my experience doing research on Stable Diffusion, scaling up the resolution is the conceptually easy part that only requires larger models and more high-resolution training data.
The hard part is semantic alignment with the prompt. Attempts to scale Stable Diffusion, like SDXL, have resulted only in marginally better prompt understanding (likely due to the continued reliance on CLIP prompt embeddings).
So, the key question here is how well Sora does prompt alignment.
There needs to be an updated CLIP-like model in the open-source community. The model is almost three years old now and is still the backbone of a lot of multimodal models. It's not a sexy problem to take on since it isn't especially useful in and of itself, but so many downstream foundation models (LLaVA, etc.) would benefit immensely from it. Is there anything out there that I'm just not aware of, other than SigLIP?
I think one part of the problem is using English (or whatever natural language) for the prompts/training. Too much inherent ambiguity. I’m interested to see what tools (like control nets with SD) are developed to overcome this.