I've been working on a side project — a web app built on React 19, TypeScript, MUI on the frontend and Node.js/Express with Firebase/Firestore on the backend, deployed on Cloud Run. The project is at a stage where there's real code, real data, and real users, which makes me much more careful about what I let an AI assistant touch without oversight.
GitHub Copilot (and AI coding assistants in general) are genuinely useful, but out of the box they have no knowledge of your project's conventions, your architecture boundaries, or which quality concerns actually matter. Ask it to "fix the login page" and it might improve the UI while quietly bypassing an auth check. Ask it to "add a new field to the database" and it won't know whether existing records need a migration.
What I wanted was something closer to a team of specialists — each one with deep knowledge of their domain — that I could invoke by name, with consistent behaviour across every session. Here's how I set that up.
Skills are reusable knowledge documents. They contain the kind of context that would normally live in a developer's head — things like the project's architecture, the OWASP Top 10 checklist tailored to this specific stack, the WCAG 2.2 audit criteria for MUI components, the Firestore schema and migration framework, and so on. Skills are not invoked directly; they're loaded by agents when relevant.
Agents are the things you actually talk to. Each agent is a specialist, defined in a
The agents I have set up:
The Plan mode output is structured: a task breakdown with which agent handles each task, what files are touched, what the prerequisites are, and what validation to run after each step. For bigger features this doubles as a way to sequence multi-agent work — the Planner produces the plan, and you hand off each task group to the appropriate specialist.
The agents also have a firm confirm-before-fix rule: don't apply edits until the user confirms, unless all three conditions are met — the finding is Critical, the fix is trivially safe, and no runtime behaviour changes. This means the agents are auditors by default and surgeons only when explicitly asked.
For browser verification I set up a
After a PR review runs, I can link the findings into a Copilot session and say "plan the fixes for these findings" — the Planner agent picks that up, groups related findings from the same file, sequences them, and assigns each task to the right specialist. The Security agent then fixes the security findings, the A11y agent handles accessibility, and so on.
The skills are living documents. When an agent produces something wrong because it lacked context, I add that context to the relevant skill rather than just correcting the agent in-session. That way the knowledge accumulates.
The second thing: the Planner agent is worth its weight. It's easy to go straight to implementation, but having an agent that does nothing except produce a plan — and is not allowed to write code — creates a natural forcing function to think about sequencing and scope before anything changes.
The third thing: explicit operating modes matter more than I expected. "Default to Plan" means I almost never get surprised by Copilot making a change I didn't expect. The implementation step is always something I consciously triggered.
The setup takes some upfront time. But for a project with real users and real data, having a consistent quality bar that doesn't depend on remembering to check accessibility, or security, or whether a schema change needs a migration — that's worth the investment.
GitHub Copilot (and AI coding assistants in general) are genuinely useful, but out of the box they have no knowledge of your project's conventions, your architecture boundaries, or which quality concerns actually matter. Ask it to "fix the login page" and it might improve the UI while quietly bypassing an auth check. Ask it to "add a new field to the database" and it won't know whether existing records need a migration.
What I wanted was something closer to a team of specialists — each one with deep knowledge of their domain — that I could invoke by name, with consistent behaviour across every session. Here's how I set that up.
The structure: agents and skills
Everything lives in.github/ so GitHub Copilot picks it up automatically. The top-level split is between agents and skills.Skills are reusable knowledge documents. They contain the kind of context that would normally live in a developer's head — things like the project's architecture, the OWASP Top 10 checklist tailored to this specific stack, the WCAG 2.2 audit criteria for MUI components, the Firestore schema and migration framework, and so on. Skills are not invoked directly; they're loaded by agents when relevant.
Agents are the things you actually talk to. Each agent is a specialist, defined in a
.agent.md file with a description (which Copilot uses for routing) and a prompt that tells it exactly what it is, what to load, and how to behave.The agents I have set up:
- Planner — doesn't write any code; breaks down features into structured, sequenced task lists and assigns them to the right specialists
- UI/UX — React + MUI frontend work, strictly within the design system
- A11y — WCAG 2.2 accessibility audits and fixes
- Security — OWASP Top 10 audit, with exploit scenarios for every finding
- i18n — translation quality, missing keys, adding new languages
- Performance — Core Web Vitals, React re-render patterns, backend latency
- Code — holistic PR-style review covering correctness, types, error handling, quality
- Database — detects whether a Firestore schema change needs a migration and writes it
- Dependencies — npm audit and upgrade planning across three workspaces
- Copilot — audits and improves the Copilot setup itself
code-review (for the output format and severity scale) and owasp-audit (for the OWASP checklist tailored to this project's attack surfaces). The A11y agent loads code-review and wcag-audit. The UI/UX agent reads the project's design system master document before touching any UI code. By composing skills into agents this way, a lot of the review framework — severity scale, output format, confirm-before-fix rules — is defined once in code-review/SKILL.md and inherited everywhere.Three operating modes
Every agent (except the Planner, which is always planning) supports three modes:- Plan — produces a structured implementation plan and waits for confirmation before doing anything
- Implement — executes a confirmed plan, with typecheck and lint validation after every change
- Review — audits and surfaces findings only, no edits made
The Plan mode output is structured: a task breakdown with which agent handles each task, what files are touched, what the prerequisites are, and what validation to run after each step. For bigger features this doubles as a way to sequence multi-agent work — the Planner produces the plan, and you hand off each task group to the appropriate specialist.
Quality gates baked in
One thing I was deliberate about: every agent is required to run validation before declaring a task complete. For frontend work that'snpm run typecheck:frontend && npm run lint:frontend from the repo root. For backend it's the equivalent backend commands. This is explicitly stated in the skill files and repeated in the agents that invoke them.The agents also have a firm confirm-before-fix rule: don't apply edits until the user confirms, unless all three conditions are met — the finding is Critical, the fix is trivially safe, and no runtime behaviour changes. This means the agents are auditors by default and surgeons only when explicitly asked.
For browser verification I set up a
browser-testing skill that agents load when they need to confirm something rendered correctly. It uses MCP servers for Chrome DevTools, the Firebase emulator, and backend logs. The skill describes a full 11-step verification workflow: discover running MCP servers, find the correct port, set up test data in Firestore, authenticate as the right user, interact, screenshot, and correlate browser errors with backend logs. The key constraint in the skill: never start, spawn, or configure MCP servers — only use what's already running. Without that rule agents tend to try to spin up things themselves and make a mess.Feeding GitHub PR reviews back in
GitHub has automated code review on pull requests. I pipe those findings back into the agents as a second opinion. The Code agent's review format is intentionally close to GitHub Copilot's PR reviewer style — inline, per-file comments grouped by file with before/after suggested changes — so the output from a manual invocation and the automated PR review are in a consistent format.After a PR review runs, I can link the findings into a Copilot session and say "plan the fixes for these findings" — the Planner agent picks that up, groups related findings from the same file, sequences them, and assigns each task to the right specialist. The Security agent then fixes the security findings, the A11y agent handles accessibility, and so on.
Git worktrees for isolation
For larger features I use git worktrees so Copilot can work in an isolated branch with its own dev server and Firestore emulator running in parallel with main. Each worktree gets its own port range (backend, auth, Firestore, Firebase UI all offset by the worktree slot number) so nothing collides. There are two shell scripts —worktree-create.sh and worktree-destroy.sh — that handle the setup and cleanup. The browser-testing skill knows about this port assignment scheme and reads PORT from backend/.env to always navigate to the right place.Codespaces and CI
The.github/workflows/copilot-setup-steps.yml workflow handles pre-installing all workspace dependencies for Codespaces so agents don't have to wait for npm ci to finish before they can start working. There's also a CI workflow that runs on every push and PR: typecheck, ESLint, Prettier format check, Vite build, etc.What I've learned from this
The hardest part wasn't writing the agent prompts — it was figuring out what knowledge actually needs to be explicit. A lot of things that seem obvious to a developer who has been in the codebase for months are completely invisible to an agent with no context.The skills are living documents. When an agent produces something wrong because it lacked context, I add that context to the relevant skill rather than just correcting the agent in-session. That way the knowledge accumulates.
The second thing: the Planner agent is worth its weight. It's easy to go straight to implementation, but having an agent that does nothing except produce a plan — and is not allowed to write code — creates a natural forcing function to think about sequencing and scope before anything changes.
The third thing: explicit operating modes matter more than I expected. "Default to Plan" means I almost never get surprised by Copilot making a change I didn't expect. The implementation step is always something I consciously triggered.
The setup takes some upfront time. But for a project with real users and real data, having a consistent quality bar that doesn't depend on remembering to check accessibility, or security, or whether a schema change needs a migration — that's worth the investment.
Comments
Post a Comment