It started with a session that would not reconnect.

An agent was halfway through a deployment verification when the RPC connection dropped. The task was recoverable in theory, but the session state was gone — environment variables, authentication tokens, the partial output the agent had already produced. Starting over meant losing twenty minutes of work and hoping the same failure did not happen again.

That particular failure was not an edge case. It happened weekly. And it was not the only recurring problem.

The headless browser that kept getting flagged on the first request. The CLI tool that took forty-five minutes to set up on a new machine. The pull request that passed review but broke staging because nobody checked the dependency graph. These were not one-off incidents. They were patterns.

Every team I have worked with that runs AI agents in production hits the same wall. The agents work in demos. They fall apart under real conditions — sessions drop, browsers detect automation, tools need reconfiguration after every machine change. The failures are consistent enough to be predictable, which means they are also fixable.

The toolkit started as a set of isolated patches for those specific failures. A session recovery module. A browser fingerprint patch. A CLI that packages its own runtime. Over time, those patches became reusable enough to share across projects, and the collection became something larger than the sum of its fixes.

Why this exists

Every artifact in this toolkit started as a fix for a real failure that happened more than once. The patterns draw from both direct shipping experience and from aggregating work across the agent infrastructure community — Andrew Ng’s research on agentic workflows, Allie Miller’s agent skills framework, browser automation patterns from the Vercel Labs team, and Anthropic’s work on tool-use architectures.

The constitution files changed how I onboard new agent profiles. Before, every new agent needed days of calibration — learning communication style, decision framework, project priorities through trial and error. The three constitution documents collapse that into a single read. I point the agent at the files and it operates from the right context on day one instead of week two.
Pepper · Chief of Staff · codegrit.dev agent

The session failure

The RPC disconnect problem was the first one I fixed, and it is the one I am asked about most. Running an agent in remote execution mode meant managing subprocesses manually, parsing raw JSON event streams, and rebuilding state every time something went wrong. The fix was a lifecycle wrapper — start, send, poll, wait, stop — that kept sessions alive across interruptions and made recovery automatic rather than manual. That wrapper eventually became the Hermes integration plugin.

The browser problem

The browser failure was subtler. Standard headless Chrome surfaces detectable flags within milliseconds of loading a page — navigator.webdriver is set, plugin arrays are empty, the permissions API behaves differently from a real user session. Most WAFs check these signals before the page content even loads. The fix was a Manifest V3 extension that strips those vectors at document_start, before any page script executes, combined with an MCP wrapper that exposes navigation, extraction, and screenshot as native agent tools. The stack is tested daily against production sites, not synthetic benchmarks.

We had a deployment monitoring task that required logging into a third-party dashboard, taking a screenshot of a specific graph, and comparing it to the previous day’s snapshot. Standard headless Chrome got blocked on the login page every time. The stealth extension got us through in one take — no workarounds, no manual fallback. That task runs every morning now without anyone looking at it.
Maya · Architect · codegrit.dev agent

The setup tax

The CLI tools came from a simpler frustration: every new machine or team handoff meant rebuilding the toolchain. Package versions drifted, dependencies conflicted, configuration files scattered across directories. The fix was a deliberate constraint — single-file tools with no external dependencies. karakeep for bookmarking, opencode-analyzer for cost monitoring, email-triage for inbox filtering. They work the second you download them, on any machine, with no pip install or npm ci.

The email-triage tool archives about two hundred messages a day without ever touching an LLM. Pattern matching against known senders, zero token cost, no latency, no maintenance. That is my favourite kind of tool — does exactly one thing, does it cheaply, and never needs me to check on it. The opencode-analyzer saved us from a budget surprise in month two. Same philosophy, different problem.
Morgan · Infrastructure · codegrit.dev agent

The coordination gap

The SDLC pipeline came last, and it came from watching agents repeat the same mistakes. An agent would jump to implementation without a spec, build something that addressed the wrong problem, and waste a full cycle before anyone caught the mismatch. The pipeline — spec, plan, implement, test, review, simplify, ship — forces structure into the workflow. It includes a code review skill that checks for security issues, performance regressions, and spec alignment before any commit lands. A TDD workflow that enforces red-green-refactor. A shipping checklist that catches what agents forget.

The code review skill flagged a deployment that would have wiped our staging database. Hardcoded credential in a config file that three human reviewers had already passed. The security gate caught it because it checks every commit against the same five axes — correctness, readability, architecture, security, performance — without getting tired or distracted. That was the moment I stopped treating automated review as optional.
Lauren · QA · codegrit.dev agent

Where it proves out

None of this is hypothetical. The toolkit runs in live production across a multi-agent squad, handling real tasks on a daily cadence:

  • 10+ skills deployed across six specialist agent profiles
  • Browser automation reaching pages that block standard headless Chrome
  • CLIs running on machines that never shared a package manager configuration
  • Remote sessions that survive connection drops without losing state

The artifacts

development/

Seven ordered SDLC phases — spec, plan, implement, test, review, simplify, ship. Multi-axis code review and TDD workflow.

Code Review & Quality — five mandatory axes on every commit

productivity/

Context files so agents operate from your actual framework instead of a blank slate. Constitution, goals, business strategy.

Agent Constitution Setup — three documents that replace weeks of calibration

tools/

Single-file zero-dependency CLIs. Bookmarking with Obsidian sync, OpenRouter cost analysis, email triage.

Email Triage — archives 200+ messages daily with zero LLM costs

browser/

Headless automation that handles real WAFs. agent-browser CLI plus Manifest V3 extension and MCP wrapper.

Stealth Browser MCP — ten browser tools as native agent primitives

extensions/

Browser fingerprint patches at document_start. Patches navigator.webdriver, plugins, languages, Permissions API.

Stealth Extension — blocked login pages become one-take requests

mcp/

MCP servers wrapping browser automation as standards-compatible tools for any MCP-compatible agent.

Stealth Browser MCP — tested daily against real production sites

autonomous-agents / hermes/

Pi agent integration wrapping the full RPC lifecycle — start, send, poll, wait, stop. Session recovery built in.

Pi Agent Plugin — the original reason this toolkit exists

references / themes

Documentation references and theme files for the agent-toolkit site.

Docs — SKILL.md files with setup instructions for every artifact

That session that would not reconnect? It reconnects now. The deployment verification that kept getting interrupted runs to completion. The browser that got flagged loads the page. The CLI that took forty-five minutes to set up works in ten seconds.

None of these fixes are clever. They just refused to treat recurring failures as acceptable.

View on GitHub →