We use GPT 4V to reason about the screen and decide what to do next. It does mak...

We use GPT 4V to reason about the screen and decide what to do next. It does make mistakes. Here's a video of it thinking a page in the shop app is an ad (https://www.youtube.com/watch?v=MKyO-U7j4Hs).

The upside is that we do prompt hacking on our end to break out of loops and heal after it's made a mistake. Having said that, we're working on improving this!

On costs, it's cheaper than you think. The entire playground demo cost us less than $10. More expensive than running a script but we believe the cost of intelligence will go down in time.

On speed, yes it is slow. We minimize this by parallelizing tests across devices on our device farm. We can normally turn results around in 2.5-4 hours depending on the number of tests.

Thanks for the questions!