I'm not sure I quite understand what you're aiming at with these questions, but there are certainly techniques in ML based on thinking of functions as vectors. The first one that comes to mind is Anyboost [1], which views boosting as doing "gradient descent" in function space - where each "gradient step" is not a typical vector (as you'd see e.g in neural nets) but a function, corresponding in practice to a base classifier. Another that is very popular are gaussian processes - one way to think about them is as modeling functions as samples from an infinite-dimensional gaussian (at least some of them).
"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.
I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.
Yeah, that was my first thought. And it's not just about them accepting typst, but also whether they would provide a template using typst, like they currently do for latex. Using the conference/journal template to write the article saves a lot of time for both submitters and editors (who have to deal with hundreds, if not thousands of submissions).
Like the other commenter said, their retail price is $2000. In the used market, they are going for about $1400-1500 based on reverb.com stats. So at least compared to those prices, $1400 new is a decent deal. And for what it's worth, they sold out pretty quickly today (or so their online store says).
Of course, $1400 is still a hefty price tag, I won't argue about that (and there's not much anything new to say about that topic, considering it's like 80% of online discourse about TE).
Yeah, the article was painting with a bit too of a broad stroke IMO, though they did briefly acknowledge "special exceptions" such as satellite or medical imagery. It's very application-dependent.
That said, in my experience beginners do often overestimate how much image resolution is needed for a given task for some reason. I often find myself asking to retry their experiments with a lower resolution. There's a surprising amount of information in 128x128 or even smaller images.
I have a vivid memory of playing Rise of the Triad[1] against my buddy over serial cable. As most PC games from back then, it used mode 13h[2], so 320x200 resolution with a 256 color palette.
I have the distinct memory of firing a rocket at him from far away because I thought that one pixel had the wrong color, and killing him to his great frustration. Good times.
You can play the shareware portion of the game here[3] to get an idea.
There's been a huge amount of work on image transformers since the original VIT. A lot of it has explored different schemes to slice up the image in tokens, and I've definitely seen some of it using a multiresolution pyramid. Not sure about the RL part - after all, the higher/low-res levels of the pyramid would add less tokens than the base/high-res level, so it doesn't seem that necessary. But given the sheer volume of work out there I can bet someone has explored this idea or something pretty close to it already.
Slicing up images to analyze them is definitely something people do - in many cases, such as satellite imagery, there is not much alternative. But it should be done mindfully, especially if there are differences between the training and testing steps. Depending on the architecture and the application, it's not the same as processing the whole image at once. Some differences are more or less obvious (for example, you might have border artifacts), but others are more subtle. For example, contrary to the expected positional equivariance of convolutional nets, they can implicitly encode positional information based on where they see border padding during training. For some types of normalization such as instance normalization, the statistics of the normalization may vary significantly when applied across patches or whole images.
$8000 also seems pretty cheap for 2PB of traffic? Looking at google cloud storage egress rates, $0.02/GiB (which is on the lower end, since it depends on destination) would be about $40k for 2PB.
Honestly, I think that if in 2020 you had asked me whether we would be able to do this in 2025, I would've guessed no, with a fairly high confidence. And I was aware of GPT back then.
[1] https://proceedings.neurips.cc/paper_files/paper/1999/file/9...