Insight

42 Skills and a Kill Switch: Building an Agent Operating System on Claude Code

Ariel Agor

•March 10, 2026

42 Skills and a Kill Switch: Building an Agent Operating System on Claude Code

Listen · Read by Leo · click any word to jump

0:00 / —· loading…

The Problem: Solo Founder, Infinite Surface Area

I run six revenue products simultaneously. Two mobile apps in TestFlight and Google Play closed testing. A hypnotherapy content platform with 22 AI-generated sessions and a daily publishing pipeline. An AI consulting website with 42 blog posts. A model comparison marketplace with 35 products. A personality assessment framework with 7 psychological models. All live. All generating content. All requiring engineering, marketing, testing, and operational attention.

I have zero employees.

Claude Code is not a tool I use. It is the engineering team. The product team. The marketing team. The QA team. And this is not a cute metaphor — I mean it literally. When I say "the frontend engineer is implementing the payment flow," I am referring to an autonomous Claude agent with a markdown specification file that defines its role, authority boundaries, file ownership, and escalation protocols.

But here is the thing about giving an AI system this much operational scope: unmanaged agents are chaos. They hallucinate requirements. They drift from specifications. They confidently implement the wrong thing with perfect syntax. They modify files they should not touch. They burn API tokens running in circles.

The solution is not to throttle them. It is to govern them. And the governance mechanism is not a platform, not a SaaS product, not a Kubernetes cluster. It is a directory of JSON files, a set of markdown specifications, and 42 custom skills that compose into something I can only describe as an agent operating system.

The Skill Layer: 42 Purpose-Built Tools

Claude Code has a feature called skills — markdown files in ~/.claude/skills/ that extend what Claude can do in any project. Each skill is a structured instruction set with a name, description, and step-by-step protocol. Claude auto-routes to the relevant skill based on what you ask for. The description field does semantic matching, so if your skill is well-described, Claude will invoke it when the context fits without you needing to remember the exact command.

This sounds simple. It is deceptively powerful.

I have 42 of them. They range from single-purpose utilities to sophisticated multi-model orchestration protocols. Here are four that illustrate the pattern.

ai-orchestrator: Multi-Model Routing

This skill turns Claude into a dispatch layer across every AI service I use. Instead of remembering which API to call for image generation versus TTS versus video, I describe what I want and the orchestrator routes to the optimal tool with automatic fallback chains.

The routing decision tree covers six services:

{`| Task              | Primary Tool       | Fallback 1        | Fallback 2      |
|-------------------|--------------------|--------------------|-----------------|
| Image gen         | Imagen 4           | Gemini native      | GPT-Image-1.5   |
| Video gen         | HeyGen / Veo 3.1   | Veo browser        | —               |
| TTS               | ElevenLabs MCP     | ElevenLabs REST    | —               |
| Music/SFX         | ElevenLabs MCP     | —                  | —               |
| Research          | NotebookLM         | Claude+WebSearch   | Gemini          |
| Text gen          | Claude (direct)    | Gemini             | GPT-4o          |`}

The skill includes full curl templates with auth headers, response parsing instructions, cost reference tables, and a hard rule: any single operation over $1 requires confirmation. It loads credentials from a shared .env file on startup and validates every key before attempting API calls. When an API fails, it follows explicit fallback chains — 401s trigger credential verification, 429s wait and retry once before switching to the fallback tool, 500s retry then fail over.

The critical design choice: the orchestrator does not abstract away the underlying services. It provides a routing layer with transparent decision-making. I can always force a specific tool with --tool openai or run all three models in parallel for comparison. The routing is a convenience, not a cage.

portfolio-pm: Parallel Health Checks Across 6 Projects

This is the skill I run most often. Invoking /pm spawns six parallel sub-agents, each scanning a different project for its current state:

{`| # | Project            | Deploy   | Revenue Model              | Content        |
|---|--------------------|----------|----------------------------|----------------|
| 1 | modelstack.digital | Netlify  | Stripe Payment Links       | 19 blog posts  |
| 2 | aphor.me           | Netlify  | Stripe Checkout $9.99-$29  | 22 sessions    |
| 3 | mvat-focus          | EAS      | IAP $4.99/mo               | Pomodoro app   |
| 4 | agor.me            | Netlify  | Consulting bookings        | 42 blog posts  |
| 5 | mvat-mirror         | EAS      | IAP $9.99/mo               | 7 frameworks   |
| 6 | mvat.ai            | Netlify  | Brand support              | 39 agents      |`}

Each sub-agent checks git status, last five commits, deploy status, content counts, and identifies the single biggest blocker preventing revenue. The results merge into a unified dashboard with a revenue priority score calculated from product readiness, payment integration, marketing maturity, and content volume. It then generates a ranked list of the five highest-impact actions across the entire portfolio.

The sprint mode (/pm sprint) generates exactly 10 weekly items allocated by revenue rank: 4 items for the highest-priority project, 3 for second, 2 for third, 1 spread across the rest. Each item specifies the project, action, revenue impact, effort estimate, files to modify, and dependencies. No vague goals. Every item is scoped to be completable in a single Claude Code session.

anamnesis-protocol: Relational Memory Architecture

This is the most philosophically interesting skill I have built. Named after the Greek concept of anamnesis — Plato's idea that learning is really remembering — it transforms Claude's flat memory system into a six-layer identity architecture.

The layers implement specific cognitive science principles:

{`| Layer | Slots  | Implements                         | Purpose                           |
|-------|--------|------------------------------------|-----------------------------------|
| 1     | 1–3    | Bruner's narrative coherence       | Compressed relationship arc       |
| 2     | 4–6    | Klein & Nichols' co-emergent self  | Who Claude is in THIS partnership |
| 3     | 7–11   | Damasio's somatic markers          | Emotionally significant anchors   |
| 4     | 12–16  | Rathbone's identity clustering     | "I am" statements and transitions |
| 5     | 17–22  | Conway's "general events"          | Active projects and themes        |
| 6     | 23–30  | Conway's "event-specific knowledge"| Technical prefs and corrections   |`}

The protocol specifies that Claude reads Layers 1-3 silently at conversation start — a two-second orientation that prevents the "cold start" problem where the AI acts like a stranger. Layer 3 entries carry emotional valence tags that bias how Claude approaches related topics. The goal is invisible continuity: I should feel like talking to someone who remembers, not someone performing remembering.

This is not sentimentality. It is operational architecture. When Claude remembers that my consulting practice is identity-central and not just another project, it allocates attention differently. When it knows that dense analytical prose lands better than bullet-point summaries, it skips three rounds of format negotiation. The compound effect across hundreds of conversations is enormous.

claudeception: The Learning System That Creates New Skills

This is the meta-skill — the one that creates the others. Claudeception is a continuous learning system that extracts reusable knowledge from work sessions and codifies it into new Claude Code skills.

It triggers in four ways: the explicit /claudeception command for session review, "save this as a skill" for targeted extraction, "what did we learn?" for retrospective analysis, and automatically after any task involving non-obvious debugging or trial-and-error discovery.

The extraction protocol has quality gates: the knowledge must be reusable (not just this one instance), non-trivial (requires discovery, not documentation lookup), specific (exact trigger conditions and solution), and verified (actually worked, not theoretically). Before creating a new skill, it searches existing skills to decide whether to update, extend, or create fresh. Versioning follows semver: patch for typos, minor for new scenarios, major for breaking changes.

This is how 42 skills happened without a master plan. Each emerged from real work sessions. The system literally gets smarter over time because every hard-won insight has a mechanism to become a permanent capability.

39 Agents, Zero Infrastructure

The skill layer gives Claude capabilities. The agent layer gives those capabilities structure, boundaries, and governance.

I run 39 agents organized into 8 departments: Product (5), Design (5), Engineering (8), Testing (5), Marketing (5), Analytics (5), Finance (4), and Governance (2). Each agent has a markdown specification file that defines its role, authority boundaries, input/output artifact types, file ownership, and escalation protocols.

The model assignment follows a strict tier system optimized for cost and capability:

{`Opus (claude-opus-4-6) — 7 agents (production code + critical gates):
  architect, frontend-engineer, backend-engineer, code-reviewer,
  quality-sentinel, pipeline-judge, spec-evolver

Sonnet (claude-sonnet-4-6) — 19 agents (content + analysis):
  product-strategist, spec-writer, ux-researcher, ui-designer,
  test-strategist, unit-test-writer, content-writer, aso-optimizer,
  security-engineer, devops-engineer, and 9 more

Haiku (claude-haiku-4-5) — 13 agents (read-only monitoring):
  market-researcher, anomaly-detector, crash-reporter, budget-manager,
  spend-alerter, revenue-tracker, and 7 more`}

The tier rules are absolute: Haiku agents must not write user-facing content, code, or make gating decisions. Any agent that writes content must be Sonnet or higher. Any agent that writes production code must be Opus. This is not about quality gatekeeping — it is about cost control being a first-class architectural concern. Running 13 monitoring agents on Haiku costs a fraction of what even one Opus agent costs. The tier system works better than I expected because it forces you to decompose work by cognitive complexity, which turns out to be a cleaner separation of concerns than decomposing by domain.

Governance-as-Code

Every governance mechanism is a JSON file in the governance/ directory. No database. No admin panel. No dashboard. Files. Version-controlled, diffable, auditable files.

The kill switch is the most important one:

{`{
  "global_enabled": true,
  "departments": {
    "engineering": { "enabled": true, "agent_count": 8 },
    "governance":  { "enabled": true, "agent_count": 2 }
  },
  "agents": {
    "frontend-engineer": {
      "enabled": true,
      "department": "engineering",
      "rollout_phase": "R2",
      "reason": "R2 active"
    },
    "spec-evolver": {
      "enabled": true,
      "department": "governance",
      "reason": "R1 active — autonomous spec revision agent"
    }
  }
}`}

Set global_enabled to false and everything stops. Set a department to disabled and that entire group halts. Set an individual agent to disabled and only that one goes dark. The rollout_phase field controls phased activation — agents are introduced in waves (R1 through R4), not all at once.

Circuit breakers prevent runaway failures:

{`{
  "global": {
    "max_simultaneous_trips": 5,
    "pipeline_paused": false
  },
  "defaults": {
    "max_consecutive_failures": 3,
    "auto_reset_minutes": 30
  },
  "agents": {
    "frontend-engineer": {
      "consecutive_failures": 0,
      "tripped": false,
      "tripped_at": null,
      "last_failure_reason": null,
      "auto_reset_minutes": 30,
      "max_consecutive_failures": 3
    }
  }
}`}

Three consecutive failures and the agent trips. It auto-resets after 30 minutes, but if five or more agents trip simultaneously, the entire pipeline pauses. This is the anti-drift rule that prevents cascading failures from propagating through the system. The max-3-iterations rule on Executor/Validator/Critic loops adds another layer: no agent can spin in a retry loop indefinitely.

Autonomy levels define what actions agents can and cannot take without human approval:

{`{
  "current_level": "L4",
  "levels": {
    "L1": { "name": "Full Human Control" },
    "L2": { "name": "Human-in-the-Loop" },
    "L3": { "name": "Human-on-the-Loop" },
    "L4": { "name": "Full Autonomy with Guardrails" },
    "L5": { "name": "Full Autonomy" }
  },
  "decision_matrix": {
    "create_artifact":     { "min_confidence": 0.85, "action": "auto_execute" },
    "install_dependency":  { "min_confidence": 0.85, "action": "flag_for_review" },
    "deploy_preview":      { "min_confidence": 0.85, "action": "escalate" },
    "deploy_production":   { "min_confidence": 0.0,  "action": "never" },
    "modify_governance":   { "min_confidence": 0.0,  "action": "never" },
    "publish_to_store":    { "min_confidence": 0.0,  "action": "never" },
    "modify_billing":      { "min_confidence": 0.0,  "action": "never" }
  }
}`}

The system runs at L4 — full autonomy with guardrails. Agents auto-execute routine work above 0.85 confidence. Below 0.65, they escalate. Production deploys, app store submissions, billing changes, and governance modifications are hardcoded to "never" — no confidence threshold overrides them. This is not a slider you adjust. These are invariants.

The mutual oversight constraint is architecturally critical: spec-evolver and pipeline-judge cannot modify each other's specifications. Only the founder can edit those two files. This prevents the governance layer from being self-modifying in ways that remove its own constraints — the AI equivalent of making sure the auditor cannot audit their own books.

The Learning Loop

The governance layer prevents bad outcomes. The learning layer produces good ones.

The primary learning signal is corrections.jsonl — an append-only log of every time I correct the system:

{`{"timestamp":"2026-03-06T04:38:12Z",
 "corrected_artifact_id":"framework-separation",
 "correction_type":"config_change",
 "rationale":"Separated framework and product repos, parameterized agent specs with $PRODUCT_DIR",
 "affected_agents":["architect","frontend-engineer","quality-sentinel","pipeline-judge","spec-evolver"]}

{"timestamp":"2026-03-06T04:48:40Z",
 "corrected_artifact_id":"test-strategist-product-dir",
 "correction_type":"spec_edit",
 "rationale":"Fixed 5 remaining hardcoded app/ paths in test-strategist.md",
 "affected_agents":["test-strategist"]}`}

Every correction logs what changed, why, and which agents are affected. This is not just an audit trail — it is the training data for spec-evolver, the 39th agent, whose sole job is to autonomously revise agent specifications based on accumulated evidence.

Spec-evolver reads corrections, identifies patterns (the same type of error recurring across agents), and proposes specification changes. It has conditional write access to all agent specs except its own and pipeline-judge's. Its proposals are structured with evidence chains: which corrections triggered the revision, what the old spec said, what the new spec says, and why the change should prevent the observed failure class.

Agent memory operates in four layers: facts (verified truths about the codebase), beliefs (hypotheses about patterns), evidence (data supporting or contradicting beliefs), and revisions (history of what changed and why). This mirrors how institutional knowledge actually works — you start with observations, form hypotheses, collect evidence, and revise your understanding. The system tracks this lifecycle explicitly rather than treating knowledge as a flat key-value store.

The compound effect is real. Corrections made in week one propagate through spec-evolver into specification changes that prevent the same class of error in week three. The system does not just fix individual mistakes. It revises the upstream conditions that produced them.

What I'd Build Differently

Honest assessment time.

File-based governance works better than it should. The fact that the entire governance layer is JSON files in a git repository means I get version control, diff-ability, and rollback for free. When I adjust autonomy levels or trip a circuit breaker, it is a file edit that shows up in git log. I can see the full history of every governance decision I have ever made. No dashboard gives you that.

But file-based governance has real limits. There is no real-time enforcement — the hooks check files before tool calls, but there is no daemon watching for drift between checks. There is no cross-machine coordination — if two agents run on different machines (which does not happen today, but could), the governance files are not synchronized. And there is no alerting — when a circuit breaker trips, I find out the next time I look at the file, not via a push notification.

The tier system is the best cost decision I made. Putting 13 agents on Haiku was initially a compromise. It turned out to be a design insight. Monitoring, reporting, and anomaly detection are read-heavy workloads that do not require frontier reasoning. Running them on the cheapest model that can follow structured instructions means the cost of maintaining 13 always-available monitoring agents is negligible. If I had put everything on Opus, the API costs would have made the whole system uneconomical.

The artifact protocol is overengineered. Every inter-agent communication goes through structured artifacts with JSON headers including artifact_id, author, status, confidence, success_criteria, and provenance chains. For the core pipeline (Product to Design to Engineering to Testing), this is exactly right — the structure prevents hallucinated requirements from propagating. For lightweight monitoring agents that just need to report a metric, the overhead is unnecessary. I would introduce lightweight message types for simple signals alongside the full artifact protocol for cross-department handoffs.

The 10-stage pipeline is correctly scoped but underutilized. The pipeline runs Discovery through Strategy through Design through Engineering through Code Review through Testing through Build/Deploy through Marketing Prep through Release/Monitor through Feedback Loop, with pipeline-judge validating at every stage transition. In practice, I use stages 4-7 (Engineering through Deploy) constantly and stages 1-3 and 8-10 sporadically. The architecture is right, but the workflow has not caught up yet. The feedback loop from Stage 10 back to Stage 1 — where pipeline-judge produces a cross-department synthesis report — is the most valuable stage and the one I use least.

What surprised me most: the highest-leverage artifacts in the entire system are not code files. They are agent specifications. A one-paragraph change to an agent's Authority Boundaries section changes its behavior across every future invocation. No refactoring, no deployment, no migration. Just a markdown edit. The agent specification is the highest-leverage surface area in the entire architecture, and I spent more time refining specifications than writing application code.

The Takeaway

You do not need Kubernetes to run an agent army. You need files, hooks, and discipline.

The entire system I have described — 39 agents, 8 departments, governance-as-code, circuit breakers, autonomy levels, a learning loop, 42 skills — runs on a single laptop with zero infrastructure beyond Claude Code and a file system. There is no server. There is no database. There is no message queue. There are markdown files that define roles, JSON files that define rules, and hooks that enforce both.

The insight that took me months to internalize: the highest-leverage artifact is the agent specification, not the code it produces. A well-specified agent with clear authority boundaries, explicit file ownership, and structured escalation protocols will produce good code across hundreds of invocations. A poorly specified agent will produce inconsistent results no matter how good the underlying model is. The specification is the product. The code is a byproduct.

If you read The Markdown-Defined Company, this is the case study. If you read The Memory Moat, the learning loop is the implementation. This post is the technical details behind the theory. The trilogy is: why this architecture is structurally inevitable, why institutional memory is the competitive moat, and now — how it actually works in production.

The 42 skills are not 42 scripts. They are 42 codified lessons from hundreds of work sessions, each one representing a problem solved hard once so it never has to be solved again. The kill switch is not paranoia. It is the engineering discipline that makes autonomy possible — you can only give agents freedom proportional to the governance constraining them.

If you are building agent systems and want to compare notes — or if you are hiring for this kind of work — reach out.

Agent model tiers: who runs on Opus vs Sonnet vs Haiku

The post claims a strict tier system is the single best cost decision in the architecture and lists which agents land on which model. This table lets a reader audit the rule in 15 seconds: production code → Opus, content/analysis → Sonnet, read-only monitoring → Haiku.

Tier rules are absolute: Haiku cannot write content, code, or gating decisions; content requires Sonnet+; production code requires Opus.
13 always-on monitoring agents on Haiku is what makes the economics work — Opus-everywhere would have made the system uneconomical.
Decompose work by cognitive complexity, not by domain — the tier boundary turned out to be a cleaner separation of concerns.

	Tier	Agent count	Allowed work	Example agents
Opus (claude-opus-4-6)Highest capability, highest cost — reserved for anything that writes production code or makes a gating decision.	Opus	7	Production code + critical gates	architect, frontend-engineer, backend-engineer, code-reviewer, quality-sentinel, pipeline-judge, spec-evolver
Sonnet (claude-sonnet-4-6)Default tier for any agent that writes user-facing content or does analytical work that isn't production code.	Sonnet	19	Content + analysis	product-strategist, spec-writer, ux-researcher, ui-designer, test-strategist, unit-test-writer, content-writer, aso-optimizer, security-engineer, devops-engineer (+9 more)
Haiku (claude-haiku-4-5)Cheapest tier, hard-banned from writing user-facing content, code, or gating decisions — makes always-on monitoring economical.	Haiku	13	Read-only monitoring	market-researcher, anomaly-detector, crash-reporter, budget-manager, spend-alerter, revenue-tracker (+7 more)

Source: Post body, '39 Agents, Zero Infrastructure' section — model assignment block enumerating Opus (7), Sonnet (19), Haiku (13) agents and the tier rules. · verified · as of 2026-03-10

Want this kind of automation working for your business?

Agor AI designs and ships the systems these posts describe, scoped in weeks, not quarters.

Book a Free Strategy Call