Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So there’s a very big difference in the sort of vision approach that browser-use does vs. what we do

browser-use is still strongly coupled to the DOM for interaction because of the set-of-marks approach it uses (for context - those little rainbow boxes you see around the elements). This means it’s very difficult to get it to reliably do interactions outside of straightforward click/type like drag and drop, interacting with canvas, etc.

Since we interact based purely on what we see on the screen using pixel coordinates, those sort of interactions are a lot more natural to us and perform much more reliably. If you don't believe me, I encourage you to try to get both Magnitude and browser-use to drag and drop cards on a Kanban board :)

Regardless, best of luck!



In our experience the DOM-based interaction is more repeatable and performant than vision / xy based, but they each have their tradeoffs, as you said click-and-drag is harder when the source and target arent classic dom elements (e.g. canvas). We'll likely add x,y-based interaction as a fallback method at some point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: