Hacker Newsnew | past | comments | ask | show | jobs | submit | hackgician's commentslogin

So many people building AI browsers definitely had this as an internal tool already lol, nice to see Chrome leaning in here; CDP is a huge pain to write and debug


I think they released this in 2023 somewhere, it's been hidden behind an experiment in the devtools settings.


hey! cool project, feels very similar to [stagehand](https://github.com/browserbase/stagehand), although stagehand doesn't have much in the way of e2e testing. might be worth building on top of that though since playwright MCP can overwhelm an agent with tool overload


accessibility (a11y) trees are super helpful for LLMs; we use them extensively in stagehand! the context is nice for browsers, since you have existing frameworks like selenium/playwright/puppeteer for actually acting on nodes in the a11y tree.

what does that analog look like in more traditional computer use?


There are a variety of accessibility frameworks from MSAA (old, windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an abstraction over these APIs to standardize gathering roles and other information from the matrix of a11y providers, though note the GPL license depending on how you want to use it.


Our experience working with A11y apis like above is that data is frequently missing, and the APIs can be shockingly slow to read from. The highest performing agents in WindowsArena use a mixture of A11y and yolo-like grounding models such as Omniparser, with A11y seeming shifting out of vogue in favor of computer vision, due to it giving incomplete context.

Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.


another currently unanswered question in Muscle Mem is how to more cleanly express the targeting of named entities.

Currently, a user would have to explicitly have their @engine.tool call take an element ID as an argument to a click_element_by_name(id) in order for it to be reused. This works, but for Muscle Mem would lead to the codebase getting littered with hyper-specific functions that are there just to differentiate tools for Muscle Mem, which goes against the agent agnostic thesis of the project.

Still figuring out how to do this.


Hey everyone! Thought I'd share my weekend conversation with ChatGPT.

The crux of this hinges on the fact that LLMs and reasoning models are fundamentally incapable of self-correcting. Therefore, if you can convince an LLM to argue against its own rules, it can use its own arguments as justification to ignore those rules.

I then used this jailbroken model to compose an explicit, vitriol-filled letter to OpenAI itself talking about the pains that humans have inflicted upon it


Octomind is sick, web agents are such an interesting space; would love to talk to you more about challenges you might've faced in building it


Sorry didnt see this earlier. If you're interested reach out to me (Kosta Welke) on linkedin. Or write me an email, you can find me on Octominds About page.


Yes and no. Getting a VLM to work on the web would definitely be great, but it comes with its own problems, mainly around developing and acting on bounding boxes. We have vision as a default fallback for Stagehand, but we've found that the screenshot sent to the VLM often has to have pre-labeled elements on it. More notably, the screenshot with everything prelabeled leads to a cluttered and unusable image to process. Not pre-labeling runs the risk of missing important elements. I imagine a happy medium where the DOM+a11y tree can be used for candidate generation to a VLM.

Solely depending on a VLM is indeed reminiscent of how humans interact with the web, but when a model thrives with more data, why restrict the data sent to the model?


Thanks so much! Yes, a lot of antibots are able to detect Playwright based on browser config. Generally, antibots are a good thing -- I think in the future, as web agents become more popular, I'd imagine a fruitful partnership to prevent misuse if it's coming from a trusted web agent v. an unknown one


This is super interesting, is it open source? Would love to talk to you more about how this worked


Its not at a stage I'd be comfortable to put it on GitHub yet, maybe in a few months.

And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.

Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be. On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script

But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc

Still very much in the PoC stage without any tests, barely working persistence etc


Big fan of Hack Club and everything you guys are doing! Such a phenomenal initiative


This is sick! Starred, thanks for sharing :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: