More

shawneechase · 2026-02-21T00:09:50 1771632590

This has been the most interesting design problem we've tackled in a while, so wanted to share.

The problem: when you deploy an agent to multiple users, every user gets the same tools with the same permissions. There's no layer between "the agent can see this tool" and "the agent can execute this tool with any parameters and the user sees the full response." For a single-user demo that's fine. For 200 users across five teams hitting SAP, Snowflake, and internal APIs, it's a non-starter.

We added three webhook-based hooks to the tool execution pipeline. What made it interesting:

The Access hook runs before the LLM even receives the tool list. If a user can't access a tool, the model never knows it exists. We added batch support here so a single webhook call evaluates the entire catalog (no N+1 when you have hundreds of tools), with TTL-based caching so you're not adding latency on every chat turn.

The Pre-Execution hook is where it gets fun. It's not just allow/deny. Your webhook can modify the inputs before the tool fires. We're using this to inject per-user compliance parameters into Snowflake queries, map human-readable IDs to internal UUIDs, and route requests to region-specific backends. The agent doesn't need to know about any of that.

The Post-Execution hook was originally just for PII redaction, but the more interesting use case turned out to be prompt injection defense. A scraping tool returns content with embedded hijack instructions? The payload gets stripped before the LLM ever sees it. This protects in both directions: the tools from the agent, and the agent from the tools.

The design decision we debated the most: hooks execute as a pipeline. Multiple hooks on the same hook point chain together, each seeing the accumulated output of the previous one. Org-level hooks run first (company-wide compliance), then project-level hooks (team-specific rules), then org-level hooks again (final PII sweep, audit logging). If any hook in the chain denies, execution stops immediately.

Each hook has its own failure mode: fail_closed or fail_open. A PII scanner going down shouldn't have the same blast radius as a telemetry collector.

There's also a dry-run mode that lets you test new hooks against live production traffic without enforcing anything. You see exactly what would have been blocked, modified, or flagged before you flip it on.

Everything's via webhooks, so you implement your logic in whatever language, on your infrastructure.

Happy to answer questions about the architecture. Docs: https://docs.arcade.dev/guides/contextual-access

shawneechase · 2025-12-19T01:24:42 1766107482

Work in AI infrastructure and we wrote this to reason about current inference pricing from a systems and economics perspective.

The core claim is that inference is often priced below marginal cost today to drive adoption, which creates incentives that are rational short-term but risky to assume are permanent.

Not arguing AI is a bubble – arguing that application builders should treat current economics as a temporary advantage, not a baseline.

shawneechase · 2025-12-16T22:28:07 1765924087

If you’ve used MCP tools in Cursor or VS Code, you’ve probably noticed that each client ends up with its own configuration, auth setup, and failure modes.

We built a way to expose a set of MCP tools behind a single endpoint, so the same configuration works across Cursor, VS Code, and other MCP clients.

The endpoint handles tool routing and user-specific auth at runtime — each user auths directly with their own credentials (no shared tokens or passthrough auth). That means teammates can reuse the same setup without reconfiguring clients or leaking access.

The post walks through a concrete developer workflow (Linear → GitHub → Slack) and how having a portable, multi-client setup changes day-to-day agent usage.

Curious how others here are handling MCP portability today.

shawneechase · 2025-12-13T01:01:15 1765587675

There’s a lot of discussion about “agent skills” vs “tools.”

After working on agents that had to ship, our conclusion is that the distinction matters architecturally — but not in the way most debates frame it.

From the model’s perspective, everything collapses into a description and an invocation surface. The real failure modes show up elsewhere: token budgets, interface design, and authentication models that don’t survive multi-user systems.

We wrote this up with concrete examples and would appreciate critique.

shawneechase · 2025-12-05T01:23:26 1764897806

Anthropic recently introduced Tool Search, which is supposed to let models discover which tool to call without loading thousands of tool definitions into context.

We wanted to understand how this behaves under something closer to a real production environment, not a demo with 10–20 tools. So we loaded 4,027 tools (Google Workspace, Slack, GitHub, Salesforce, ClickUp, HubSpot, etc.) and ran 25 simple, unambiguous tasks.

We didn’t test tool calling — only retrieval. The question: Does the correct tool appear in the top-K results?

Some categories performed extremely well. Others failed on very basic cases (“send an email”, “post to Slack”). The unevenness was more significant than we expected.

This isn’t a critique of Anthropic — the underlying architecture makes sense — but the empirical behavior is worth sharing because tool discovery is becoming a foundational layer for agent reliability.

Full write-up + raw logs: https://blog.arcade.dev/anthropic-tool-search-4000-tools-tes...

Curious whether others have run large-scale tool retrieval tests or built custom search/ranking layers for agents.

shawneechase · 2025-12-04T01:47:31 1764812851

Anthropic recently introduced Tool Search, which lets Claude look up tools dynamically instead of loading everything into the context window. It’s a promising idea, especially if you’re working with large tool catalogs.

We maintain ~4,000 agent-optimized tools at Arcade.dev (Gmail, Slack, GitHub, Drive, Zendesk, Salesforce, etc.), so we ran a simple evaluation: 25 straightforward tasks that normally achieve ~100% retrieval accuracy when the toolset is small (<50 tools).

Examples:

“Send an email to my colleague…” → expect Gmail_SendEmail

“Post a message to #general…” → expect Slack_SendMessage

“Schedule a meeting…” → expect GoogleCalendar_CreateEvent

We tested both built-in search modes (regex + BM25) and counted a task as correct only if the expected tool appeared in the top-K returned references (K=5). We did not test whether Claude then chose the tool correctly or generated good parameters — just the retrieval.

Results:

- Regex search: 56% (14/25)

- BM25 search: 64% (16/25)

Some tools were consistently found (Calendar, GitHub, Docs, Spotify, Salesforce). Others — including the most common ones — were not retrieved reliably (Gmail_SendEmail, Slack_SendMessage, Zendesk_CreateTicket, ClickUp_CreateTask, HubSpot_CreateContact, etc.).

Our read: Tool Search is architecturally the right direction (defer tool loading, avoid context bloat, JIT retrieval). But at ~60% retrieval accuracy across a large catalog, it’s not production-ready for agents that must reliably choose the right tool and parameters.

Results + full data + source code are here: https://blog.arcade.dev/anthropic-tool-search-4000-tools-tes...

Curious whether others have tried Tool Search at scale or have observed similar behavior?

shawneechase · 2025-12-03T06:32:44 1764743564

Anyone who's tried to build a serious AI agent has run into the same blocker: external authorization.

Not “is this client allowed to talk to the MCP server?” That’s a different problem.

I mean the harder part: How does an agent securely obtain user-level OAuth credentials for systems like Gmail, Slack, Outlook, Jira, Salesforce, etc.?

Until now, MCP had no standard way to do this safely. Every real agent demo quietly relied on one of these hacks:

→hardcoded service accounts

→bot tokens that don’t match user permissions

→tokens injected directly into the LLM (a non-starter)

→custom glue code around device codes or one-off redirects

All of that works for a demo. None of it survives a security review.

The latest MCP spec update introduces URL Elicitation, which finally defines a standard, secure way for agents to run OAuth flows for external systems — without exposing tokens to the model, the client, or the editor runtime.

The workflow:

1. the agent realizes it needs access to an external system

2. it triggers a browser-based OAuth flow

3. the user authenticates directly with the external provider

4.tokens stay inside a trusted boundary

5. the LLM never touches credentials

This upgrades MCP from “great for local demos” to “viable for multi-user, production agent systems that interact with real services.”

Arcade.dev co-authored this SEP with Anthropic and the broader MCP community, and we already support it in our runtime. If you want the detailed breakdown — why external auth was missing, why LLMs can’t participate in auth, and how URL Elicitation plugs the hole — here’s the full post:https://blog.arcade.dev/https-arcade-dev-blog-mcp-url-elicit...

shawneechase · 2025-11-20T00:46:11 1763599571

One thing that keeps coming up when teams deploy MCP agents: the server passes local tests, but production falls apart. The root cause is usually client behavior, not the server.

MCP clients load all tool schemas into every model call. Those schemas dominate the context. After ~20–40 tools, the model’s tool-selection accuracy drops sharply — not due to reasoning limits, but because the action space explodes.

Sampling is supposed to fix the common “tool got partial info and needs clarification” problem by letting the tool call back into the LLM mid-execution. But most clients don’t support sampling, so the tool can’t ask follow-up questions and just fails.

If you’re designing for production: treat tool definitions as scarce context, split tools into smaller surfaces, and avoid any MCP client that doesn’t support sampling. Most real agent failures can be traced back to that combination.

shawneechase · 2025-10-31T23:51:47 1761954707

We’ve built more tools for AI agents than anyone — Gmail, Slack, Jira, Notion, Drive, Salesforce — all running securely through Arcade.

After building hundreds of MCP servers, we noticed the same pattern: they all worked perfectly until you tried to deploy them.

OAuth tokens leaked into logs. Secrets ended up hardcoded. Multi-user context melted down as soon as a second person joined.

So we built the thing we wish we’d had from the start — the Secure MCP Framework.

It’s an open-source framework for running production-ready MCP servers with:

- Real OAuth flows (no copy-pasted examples)

- Secrets that never touch the LLM or client

- Built-in token refresh and multi-user context

- Local-first evals so you can test before deploying

Everything runs securely — tokens handled server-side, auth scoped per user, context managed automatically. The LLM never sees credentials.

Why it matters: Most “agents” look great in demos and die in production because of basic security issues. We built Secure MCP so you don’t have to rebuild your server every time your CISO finds a token in the logs.

Works anywhere: Cursor, VS Code, Claude Desktop, ChatGPT, or your own MCP client.

shawneechase · 2025-10-23T22:44:19 1761259459

We’ve built more production-grade tools for AI agents than anyone — Gmail, Slack, Notion, Jira, Google Drive, Salesforce — all running securely through Arcade.

And one thing became obvious – if you’re serious about agents, you eventually need your own MCP server.

That’s why we wrote a Custom MCP Server Framework — a step-by-step guide to spin up an MCP-compliant backend that can power your own tools, services, or internal APIs.

It covers:

- Spinning up a server that speaks the Model Context Protocol (MCP)

- Exposing tools via standard endpoints (/context, /tools, /call, etc.)

- Handling authentication, state, and user context correctly

- Deploying beyond localhost — production-ready in minutes

Why it matters:

Most agent frameworks stop at “tool calling.” But the real challenge is the backend — managing identity, context, and persistence across thousands of tool calls.

With an MCP server, your agents can:

- Connect to any system — SaaS or internal — through standardized interfaces

- Enforce permissions and context per user

- Scale from prototype to production without rewriting core plumbing

Why we care:

At Arcade.dev, we’re pushing to make MCP the open standard for connecting agents and tools. Our platform already runs thousands of tools — this guide shows how to make it your own.

Full quickstart here → https://docs.arcade.dev/en/home/custom-mcp-server-quickstart