It started with a session that would not reconnect.
An agent was halfway through a deployment verification when the RPC connection dropped. The task was recoverable in theory, but the session state was gone — environment variables, authentication tokens, the partial output the agent had already produced. Starting over meant losing twenty minutes of work and hoping the same failure did not happen again.
That particular failure was not an edge case. It happened weekly. And it was not the only recurring problem.
The headless browser that kept getting flagged on the first request. The CLI tool that took forty-five minutes to set up on a new machine. The pull request that passed review but broke staging because nobody checked the dependency graph. These were not one-off incidents. They were patterns.
Every team I have worked with that runs AI agents in production hits the same wall. The agents work in demos. They fall apart under real conditions — sessions drop, browsers detect automation, tools need reconfiguration after every machine change. The failures are consistent enough to be predictable, which means they are also fixable.
The toolkit started as a set of isolated patches for those specific failures. A session recovery module. A browser fingerprint patch. A CLI that packages its own runtime. Over time, those patches became reusable enough to share across projects, and the collection became something larger than the sum of its fixes.
Every artifact in this toolkit started as a fix for a real failure that happened more than once. The patterns draw from both direct shipping experience and from aggregating work across the agent infrastructure community — Andrew Ng’s research on agentic workflows, Allie Miller’s agent skills framework, browser automation patterns from the Vercel Labs team, and Anthropic’s work on tool-use architectures.
The session failure
The RPC disconnect problem was the first one I fixed, and it is the one I am asked about most. Running an agent in remote execution mode meant managing subprocesses manually, parsing raw JSON event streams, and rebuilding state every time something went wrong. The fix was a lifecycle wrapper — start, send, poll, wait, stop — that kept sessions alive across interruptions and made recovery automatic rather than manual. That wrapper eventually became the Hermes integration plugin.
The browser problem
The browser failure was subtler. Standard headless Chrome surfaces detectable flags within milliseconds of loading a page — navigator.webdriver is set, plugin arrays are empty, the permissions API behaves differently from a real user session. Most WAFs check these signals before the page content even loads. The fix was a Manifest V3 extension that strips those vectors at document_start, before any page script executes, combined with an MCP wrapper that exposes navigation, extraction, and screenshot as native agent tools. The stack is tested daily against production sites, not synthetic benchmarks.
The setup tax
The CLI tools came from a simpler frustration: every new machine or team handoff meant rebuilding the toolchain. Package versions drifted, dependencies conflicted, configuration files scattered across directories. The fix was a deliberate constraint — single-file tools with no external dependencies. karakeep for bookmarking, opencode-analyzer for cost monitoring, email-triage for inbox filtering. They work the second you download them, on any machine, with no pip install or npm ci.
The coordination gap
The SDLC pipeline came last, and it came from watching agents repeat the same mistakes. An agent would jump to implementation without a spec, build something that addressed the wrong problem, and waste a full cycle before anyone caught the mismatch. The pipeline — spec, plan, implement, test, review, simplify, ship — forces structure into the workflow. It includes a code review skill that checks for security issues, performance regressions, and spec alignment before any commit lands. A TDD workflow that enforces red-green-refactor. A shipping checklist that catches what agents forget.
Where it proves out
None of this is hypothetical. The toolkit runs in live production across a multi-agent squad, handling real tasks on a daily cadence:
- 10+ skills deployed across six specialist agent profiles
- Browser automation reaching pages that block standard headless Chrome
- CLIs running on machines that never shared a package manager configuration
- Remote sessions that survive connection drops without losing state
The artifacts ↗
development/
Seven ordered SDLC phases — spec, plan, implement, test, review, simplify, ship. Multi-axis code review and TDD workflow.
productivity/
Context files so agents operate from your actual framework instead of a blank slate. Constitution, goals, business strategy.
tools/
Single-file zero-dependency CLIs. Bookmarking with Obsidian sync, OpenRouter cost analysis, email triage.
browser/
Headless automation that handles real WAFs. agent-browser CLI plus Manifest V3 extension and MCP wrapper.
extensions/
Browser fingerprint patches at document_start. Patches navigator.webdriver, plugins, languages, Permissions API.
mcp/
MCP servers wrapping browser automation as standards-compatible tools for any MCP-compatible agent.
autonomous-agents / hermes/
Pi agent integration wrapping the full RPC lifecycle — start, send, poll, wait, stop. Session recovery built in.
references / themes
Documentation references and theme files for the agent-toolkit site.
That session that would not reconnect? It reconnects now. The deployment verification that kept getting interrupted runs to completion. The browser that got flagged loads the page. The CLI that took forty-five minutes to set up works in ten seconds.
None of these fixes are clever. They just refused to treat recurring failures as acceptable.