Invoke.dev

AI Mandates

Alex Robinson — Tue, 07 Apr 2026 13:32:44 GMT

AI adoption metrics are improving inside organizations where honest feedback has become professionally dangerous. High usage numbers and favorable survey results are the predictable output of mandates tied to performance reviews. They tell you what people are willing to report, not what’s actually happening.

Adoption That Can’t Be Questioned Isn’t Adoption

AI mandates, formal requirements that engineers use specific tools, meet usage thresholds, or demonstrate AI integration in performance reviews, have become a standard corporate response to competitive pressure. Shopify, Duolingo, Coinbase, and Block are the names that made headlines, but the pattern runs far deeper. A September 2025 survey found 58% of companies now require some employees to use AI tools, with 24% mandating it across all roles.

The business case for these mandates relies on productivity numbers that don’t survive scrutiny. Vendor-funded studies report 55% faster task completion. The independent research lands differently. The DORA 2025 report analyzed telemetry from over 10,000 developers across 1,255 teams and found no significant correlation between AI adoption and organizational-level delivery outcomes. Faros AI found that while individual developers completed 21% more tasks, PR review times increased 91% and PR sizes grew 154%. The bottleneck moved rather than disappeared.

DX’s longitudinal study across 400 companies found that AI usage increased by an average of 65% while PR throughput increased by just under 10%. That figure is consistent with what other independent research is finding as the realistic ceiling.

The gap between what mandates promise and what organizations experience is significant. What makes it worse is that the organizations least likely to discover this gap are the ones that have made honest feedback professionally dangerous.

The Mechanism That Produces Silence

When engineers raise concerns about AI adoption, specific things tend to happen. They get labeled as resistant to change. Their concerns get reframed as fear rather than technical judgment.

The most visible recent example is Block. Employees voiced concerns about AI mandates in a company-wide meeting. Within a month, the company announced a 40% workforce reduction. Jack Dorsey cited AI as the reason. Block’s stock rose 24%.

The lesson engineers draw from that sequence doesn’t need to be causally accurate to be real. Enough people observed the pattern for it to shape behavior. When speaking up about AI concerns produces visible professional consequences, people stop speaking up. Usage metrics improve while the underlying experience deteriorates. The organization goes blind to its own problems.

Companies that mandate AI and then blame it for layoffs are engaged in what researchers now call AI washing. A Resume.org survey found that 59% of hiring managers cite AI as the reason for cuts because it lands better with investors than admitting to financial pressure. It’s a better story than overhiring.

When AI is simultaneously mandated and the justification for headcount reductions, engineers are not being irrational when they treat questions about AI adoption as questions about their continued employment. They’re reading the room accurately.

What Gets Lost

The costs of silence-based adoption don’t show up in usage metrics.

The first cost is the diagnostic signal. DORA’s research is explicit: AI amplifies existing organizational strengths and weaknesses. Teams that can tell you where AI creates friction are the teams that could help you fix it. When those teams go quiet, the organizations most in need of course correction lose the feedback loops that would enable it. You keep optimizing a metric that stopped connecting to reality.

The second cost is the apprenticeship structure. Feng et al., in a peer-reviewed study of 442 developers, documented a shift that no productivity metric captures. Before mandate-driven adoption, junior engineers learned through pair programming, code reviews, and mentorship. After that, it shifted to what participants described as private copy-paste. Senior engineers became reviewers of AI output rather than mentors, developing human judgment. Gartner projects that by 2030, half of enterprises will face irreversible skill shortages in at least two critical roles because of unchecked automation. That projection is the long-horizon consequence of a short-horizon decision being made right now, in thousands of engineering organizations, without anyone measuring it.

The third cost is senior talent. The burnout study established that autonomy is the single strongest protective factor against AI-related burnout, stronger than learning resources and stronger than positive attitudes toward AI. When mandates remove autonomy, prescribe tools, require usage metrics, and tie compliance to performance reviews, the engineers with the most options respond first. They don’t complain. They leave. The organizations that notice this are measuring retention alongside adoption. Most are not.

The Structural Problem

Psychological safety is the belief that you won’t be punished for speaking up with questions, concerns, or mistakes. An MIT Technology Review survey found 83% of executives believe it measurably improves AI initiative success. That same survey found organizations often promote a “safe to experiment” message publicly while deeper undercurrents work against it.

The message and the mechanism are different things.

The mechanism is organizational design, not communication. It’s what happens to the engineers who speak up. It’s whether usage metrics are tied to performance reviews. It’s whether skeptics get engaged or reassigned. It’s whether the engineering leader who reports that adoption is struggling gets support or pressure. These are structural questions, and they have structural answers.

AI mandates also land on workforces that often already have a calibrated read on what happens to people who push back. Prior layoffs, return-to-office pressure, and performance friction have already taken a toll on psychological safety before AI tools enter the conversation. The mandate doesn’t arrive in a vacuum. It arrives in an organizational history, and engineers read it through that history.

What Engineering Leaders Can Do

The engineering leader’s position is specific: you receive mandates from above and absorb the consequences below. You didn’t design the context, but you have discretion within it. That discretion is where responsible adoption actually lives.

Separate adoption metrics from performance evaluation. Tying AI usage to performance reviews guarantees that usage data is socially constructed rather than organizationally informative. If engineers believe their job security depends on demonstrating usage, their reported usage tells you nothing about genuine integration. For most engineering leaders, this is a design choice within their direct authority.

Instrument the system rather than the people. Telemetry on PR sizes, review times, deployment frequency, and change failure rate tells you what’s actually happening without asking people to self-report under pressure. If AI adoption is improving the system, you’ll see it. If it’s creating bottlenecks, you’ll see those too. The frame shifts from “are people using the tool” to “is the tool working,” which is a more honest question. The PR review bottleneck is often the first place this shows up.

Protect the apprenticeship loop deliberately. Define which problem types junior engineers solve without AI during their development period. Require that the developer who submitted the AI-generated code explain it before it is merged. Skill erosion is not hypothetical. It follows a documented pattern when developers stop practicing the reasoning that builds expertise. These aren’t anti-AI policies. They’re skill resilience policies, and naming them that way matters for how the team receives them.

Name the pressure your team is under. Your team already knows the mandate came from above. They know you’re operating within constraints you didn’t design. Being explicit about the organizational context, while being clear about what you control within it. This builds trust, which is critical for an effective adoption.

The Measure That Matters

Usage rate is not a measure of successful adoption. It’s a measure of compliance.

The actual measure is whether the people doing the integration can tell you honestly where it’s working, where it’s not, and what it’s costing them, and whether that honesty changes anything. Right now, in most engineering organizations, it doesn’t. Changing that is harder than deploying a tool and more consequential than any adoption metric. It’s also the part of this problem that only engineering leaders can solve, because it lives inside the teams they run, not in the mandates they receive.

Total Cost of Ownership

Alex Robinson — Tue, 31 Mar 2026 11:51:55 GMT

Most ROI calculations for AI tooling share the same quiet flaw. They project productivity gains from research conducted on developers who use AI substantively every day, then price the investment at the tier where developers use it occasionally. The math looks compelling, but it isn’t realistic.

Three Tiers

The AI coding tool market has sorted itself into three adoption tiers, each with different usage patterns and a different relationship to the productivity research teams cite when making the investment case.

AI Tourists spend $10-20 per developer per month. Tourists use AI the way most of us use Stack Overflow: occasionally, when stuck, for quick lookups or autocomplete on familiar patterns. AI hasn’t changed how they work.

AI Commuters spend $100-200 per developer per month. Commuters use AI across the full SDLC: planning, generation, refactoring, debugging, code review. This is what “we use AI for development” actually means when organizations report productivity gains.

AI Natives spend $1,000 or more per developer per month. This tier runs full agentic workflows: multiple parallel agents executing multi-file tasks autonomously, agents that draft specs, generate code, run tests, and open pull requests without human prompting at each step.

The productivity research that justifies AI investment was conducted on substantive users. Abi Noda, whose firm DX has tracked AI adoption across hundreds of engineering organizations, puts it plainly: developers save about three hours per week when they’re actually using the tools. Developers in the heaviest usage quartile show measurable gains. Light users show minimal impact. A developer doing real agentic work throughout the day is a Commuter. That’s the realistic baseline for budgeting.

Why Commuter Is the Realistic Baseline

The Commuter range isn’t a power-user premium. It’s what a developer who uses agentic coding tools throughout a normal workday actually consumes.

Here’s why. Agentic coding tools don’t send a single prompt and wait. Each interaction is a multi-turn conversation: the system prompt, the full conversation history, the contents of files pulled into context, and tool-use tokens from file reads, searches, and test runs. A straightforward “refactor this function” request might load tens of thousands of tokens before the model generates a single line. A session working across multiple files on a mid-size codebase consumes more. As the day progresses, each follow-up request carries the accumulated history of everything before it. Context compounds.

The result is that a developer using agentic tools regularly throughout the day lands in the $100-200 per developer per month range. Anthropic publishes an average daily cost figure consistent with this range for API users. That average reflects normal development work: planning a feature, iterating on implementation, debugging, reviewing diffs. The math works out roughly the same whether you’re on a flat-rate subscription that covers that usage or paying API rates directly.

For organizations deploying at scale, the structure matters. Team plans bundle access and usage into a per-seat fee. Enterprise plans separate the two: the seat fee covers access, and token consumption is metered on top at API rates. Either way, the underlying consumption is the same. A developer doing substantive agentic work costs roughly what the Commuter tier costs, regardless of how the invoice is structured.

What You’re Actually Paying For

Subscription pricing obscures a meaningful gap between what you pay and what you consume. Flat-rate plans are priced well below actual API cost at heavy usage levels. That gap represents a vendor subsidy designed to build habit and market share at below-cost pricing.

Cursor’s June 2025 switch from request-based pricing to a credit model tied to actual API consumption was an early signal of where this is heading. The developer backlash was sharp enough that Cursor apologized and issued refunds. That reaction tells you something: developers had become dependent on flat-rate pricing, and the dependency became visible the moment it was threatened.

Commuter adoption also requires a stack, not a single tool. An IDE assistant, a code review tool, a chat model. At honest per-seat pricing, that’s $100-200 per developer per month in subscriptions alone before any overhead. A more honest planning assumption runs the ROI at 2-3x current subscription pricing. If the investment still works, proceed with confidence. If it only pencils at today’s subsidized rates, that’s a pricing dependency worth naming explicitly.

The Hidden Costs

Subscription pricing, even at honest Commuter rates, isn’t the full story. Five cost categories don’t appear on vendor pricing pages.

Rework and quality debt. CodeRabbit’s analysis of 470 pull requests found AI-generated code produces 1.7x more defects than human-written code, with security vulnerabilities appearing twice as often. GitClear’s analysis of 211 million lines of code found code duplication surged eight-fold between 2022 and 2024. A Harness survey found 67% of developers now spend more time debugging AI-generated code than writing it themselves. The tool generating more defects is simultaneously increasing the time required to catch them.

Review bottleneck. Agentic coding drives 98% more pull requests and 154% larger PRs, according to Faros AI’s analysis of 10,000 developers, while review time jumps 91% and organizational delivery metrics stay flat. AI generates a 500-line PR at near-zero marginal cost. A senior engineer still needs two hours to evaluate it. Accelerating code generation without modernizing review doesn’t improve throughput. It moves the bottleneck downstream.

Adoption ramp. Productivity gains require rebuilding the workflow around the tools. CI/CD pipelines need hardening to handle the volume. Teams need to develop and standardize a shared library of agent skills and workflows before individuals work efficiently. Best practices are still maturing, and that instability has a cost. What worked six months ago may already be obsolete.

Governance and compliance. Any serious AI implementation requires governance: access controls, data handling policies, audit logging, and security review processes. For publicly traded companies and large organizations, that bar is higher still. Veracode found that AI introduced security vulnerabilities in 45% of coding tasks, because models optimize for tests passing rather than security properties. The tooling, training, and remediation work to meet that bar doesn’t appear on the vendor pricing page.

Skill erosion. Anthropic’s January 2026 randomized study found developers using AI scored 17% lower on code comprehension than those who coded manually, with debugging showing the largest gap. The capability that makes AI output valuable is the human judgment to evaluate it. A workforce that generates code quickly but can’t debug when AI fails, can’t understand system architecture, and can’t maintain code over the long term is a compounding liability, not a productivity asset.

What Defensible Budgeting Looks Like

Three moves make the calculation honest. First, budget at Commuter costs with 30-40% Year One overhead for the hidden categories. Use cases that still deliver positive ROI at that number are worth pursuing. Use cases that only pencil at Tourist pricing aren’t ready for workflow-level investment.

Second, stress-test the subsidy. Build a second scenario at 2-3x current subscription pricing. A decision that depends on a specific vendor pricing structure should acknowledge that dependency in the risk column, alongside technical risk.

Third, treat engineering depth as a strategic asset. The fundamentals that let developers evaluate AI output are the same capabilities that provide leverage when vendor pricing shifts. Minimum competency floors aren’t about slowing adoption. They preserve the auditing capability that makes AI output usable.

The ROI case for AI tooling works, at honest cost inputs, with realistic productivity estimates, for use cases that match the research conditions. The number is more modest than the vendor pitch and the cost is higher than the pricing page suggests. But there’s genuine value here for teams that measure it correctly.

I’m curious whether your team has done the full TCO version and where it lands. Does the investment hold at Commuter costs with the hidden categories included? That’s the conversation worth having before the budget is set.

Treat Skills as Code

Alex Robinson — Tue, 17 Mar 2026 12:44:37 GMT

Agent skills have quietly become the most portable building block in agentic engineering. A single SKILL.md file works across Claude Code, Gemini CLI, Codex, and any agent that supports the open standard. Write it once, use it everywhere. No vendor lock-in, no framework dependency, no API integration. Just a folder with markdown and optional scripts.

That portability is why skills matter more than most people realize. MCPs tie you to specific tool integrations. Custom agents require scaffolding. Skills are plain text that any model can read, and they carry your team’s institutional knowledge in a format that survives platform changes.

But there’s a problem. We treat skills like documentation when we should treat them like code.

Skills Ship Without Tests

Every other dependency in your stack has quality gates. Libraries have test suites. APIs have contracts. Infrastructure has validation. Skills have none of that. They’re markdown files that load into a context window and steer agent behavior with zero guarantees.

A broken skill doesn’t throw an error. It silently degrades output, triggers on the wrong task, or wastes tokens explaining things the model already knows. You won’t see a stack trace. You’ll see a slightly worse result that you might not even notice, because you have no baseline to compare against.

Most skill authors iterate by feel. Write the SKILL.md, test it manually a couple of times, ship it, move on. Did it actually trigger when it should? No idea. Did it improve output quality? Probably, maybe, it felt better. When models update (and they update constantly), did the skill stop working? You’ll find out when someone complains, or more likely, you won’t find out at all.

This is the measurement gap. Without a baseline, you can’t distinguish a skill that helps from one that wastes tokens. Without regression testing, model updates silently break skills that worked last month. Without structured evaluation, “it seems better” is the best feedback you’ll ever get. Improving from anecdotal evidence is like optimizing database queries without a profiler.

The skills ecosystem keeps growing. The Tessl Registry hosts over 2,000 public skills. Teams are building internal skill libraries shared across engineering organizations. Multiple agents consuming the same broken skill compounds the damage across every session. Anthropic acknowledged the gap directly in their March 2026 skill-creator update, which added structured evals and benchmarking to the authoring workflow. The ecosystem needed quality infrastructure. It’s arriving.

What Makes a Skill Good

Before reaching for evaluation tools, you need to know what “good” looks like. These heuristics come from Anthropic’s best practices, Tessl’s review scoring criteria, and findings from the SkillsBench benchmark. They catch the majority of issues manually.

Trigger accurately

The description field in your frontmatter is the only thing the agent reads before deciding whether to load the full skill. It carries disproportionate weight.

# Good: specific, third-person, includes trigger phrases
---
name: api-design
description: >
  Designs REST APIs following team conventions for naming,
  versioning, and error handling. Use when creating new
  endpoints, reviewing API designs, or updating existing
  APIs to match current standards.
---

# Bad: vague, no trigger context
---
name: api-helper
description: Helps with API stuff
---

Write descriptions in third person. Include an explicit “Use when...” clause. Test the description against near-miss queries, things that share keywords with your skill but shouldn’t trigger it. The negative test cases matter more than the positive ones. Anthropic’s documentation notes that Claude uses the description to choose the right skill from potentially 100+ available skills. If your description doesn’t include both what the skill does and when to use it, the skill is invisible.

Stay lean

Every token in your SKILL.md competes with conversation history, other skills, and the actual task. Anthropic’s guidance is blunt: Claude is already very smart. Only add context Claude doesn’t have.

The single highest-leverage improvement for any skill: read every paragraph and ask, “If I delete this, would the agent’s output actually get worse?” If the answer is probably not, delete it. Most skills are 40-60% longer than they need to be. Don’t explain what PDFs are before showing how to extract text from them. Don’t define REST conventions before describing your team’s specific API patterns. Encode what’s unique to your context: team conventions, architecture decisions, deployment procedures.

Keep SKILL.md under 500 lines. Under 300 is better.

Use progressive disclosure

Skills use a three-level loading system that controls token cost. At startup, only metadata (name and description, roughly 100 tokens) loads into context. The full SKILL.md loads only when the skill triggers. Referenced files load only when the SKILL.md explicitly directs the agent to read them.

api-design/
├── SKILL.md              ← Loads on trigger (~200 lines)
├── references/
│   ├── error-codes.md    ← Loads only when needed
│   └── naming-rules.md   ← Loads only when needed
└── scripts/
    └── validate.py       ← Executes, never loads into context

Put detailed schemas, checklists, and examples in reference files. Never chain references more than one level deep. SKILL.md can point to references/error-codes.md, but that file should not reference another file. Agents fail on multi-hop chains.

Match specificity to the need

Not every instruction needs the same specificity. Match it to how fragile the task is.

Database migrations need exact scripts with no parameters. This requires high specificity, because variation is a bug. Code review needs heuristics and guidance, because the right feedback depends on context. Report generation sits in the middle: provide a template but let the agent adapt to the data.

For every instruction that uses MUST, ALWAYS, or NEVER, check whether you’ve explained why. “Filter test accounts because production reports with test data caused incorrect business decisions” works better than “ALWAYS filter test accounts.” Models respond to reasoning better than commands.

Make it testable

A skill you can’t evaluate is a skill you can’t improve. Design with verification in mind. Identify the failure modes you’d see without the skill. Define what correct output looks like for 3-5 realistic scenarios. Include concrete input/output examples in the skill itself, because models follow demonstrated patterns better than abstract instructions.

# In your eval definition
{
  "skill_name": "api-design",
  "evals": [
    {
      "id": 1,
      "prompt": "Design endpoints for user profile management",
      "expected_output": "REST endpoints following team naming 
        conventions with proper versioning and error handling",
      "assertions": [
        "Uses plural nouns for resource names",
        "Includes /v1/ path prefix",
        "Defines error response schema"
      ]
    }
  ]
}

These heuristics work for manual review. But manual review doesn’t scale, doesn’t catch regressions across model updates, and doesn’t produce the data you need to improve systematically. That requires tooling.

The Evaluation Landscape

At least three solutions now exist for skill evaluation. Each approaches the problem from a different angle, and together they signal where the market is heading: skills require the same quality infrastructure as any other code dependency.

Anthropic Skill-Creator 2.0

Anthropic’s March 2026 update added structured evaluation directly into the skill authoring workflow. It runs parallel A/B tests with blind comparison (a comparator agent judges output quality without knowing which version produced it). It optimizes trigger descriptions using a train/holdout split to prevent overfitting. The entire workflow runs through natural language. No coding required.

The limitation is scope. Skill-creator works inside Claude’s ecosystem only. No cross-model testing, no CI/CD integration, no registry for distributing evaluated skills to a team.

Promptfoo (OpenAI)

OpenAI acquired Promptfoo on March 9, 2026 for its evaluation and red-teaming capabilities. Promptfoo’s open-source CLI lets you define test rubrics in YAML, run assertions against agent output, and scan for security vulnerabilities using OWASP and NIST presets. Over 350,000 developers use it, including teams at 25% of Fortune 500 companies.

Promptfoo is a general-purpose eval tool, not skill-specific. It has no structural review, no skill registry, and no built-in concept of baseline-vs-skill comparison. OpenAI committed to keeping it open source, but the deepest integration will favor the Frontier enterprise platform. It represents where OpenAI sees the evaluation layer heading: embedded in the platform, tightly coupled with agent deployment.

Tessl

Tessl is the most vendor-agnostic option available at the moment. Built as a package manager for agent skills, it provides the full lifecycle: build, evaluate, distribute, optimize. Its evaluation stack has three layers.

tessl skill lint validates structure and packaging. Does the frontmatter parse? Are required fields present? Think of it as a compilation check.

tessl skill review scores your skill against best practices across three dimensions: validation (schema hygiene), implementation quality (clear instructions, concrete examples), and activation quality (discoverable triggers, specific keywords). The output includes actionable suggestions.

$ tessl skill review ./api-design

Validation Checks
  ✔ frontmatter_valid
  ✔ name_field - 'api-design'
  ✔ description_field - valid (186 chars)
  ✔ description_voice - uses third person
  ⚠ description_trigger_hint - missing 'Use when...'
  ✔ skill_md_line_count - 147 (<= 500)

Implementation Score: 78%
Activation Score: 62%
Overall Score: 71%

tessl eval run measures behavioral impact. It generates realistic scenarios, runs the agent through each scenario with and without the skill, and scores outputs against criteria. The gap between baseline and skill-augmented scores tells you whether the skill changes behavior in the direction you want. If there’s barely any difference, the agent handled the task fine without your skill. If scores drop, the instructions hurt more than they help.

Tessl’s registry hosts over 2,000 evaluated skills. Evals run automatically when you publish, catching regressions before they reach users. Results are version-pinned, so you can compare v1.1.0 against v1.2.0 with data instead of intuition. It works across Claude Code, Gemini CLI, Codex, and any agent supporting the skill spec.

Where each tool fits

These tools aren’t competing. They’re complementary, and together they represent the market converging on a single idea: skills need quality infrastructure. Anthropic built it into the authoring experience. OpenAI acquired it for the security layer. Tessl built the vendor-agnostic lifecycle platform.

The practical starting point: use Anthropic’s skill-creator for the authoring loop if you’re working in Claude. Use Tessl’s lint and review to catch structural issues and get a scored quality baseline that works across agents. The tools compose well together, and Tessl’s own documentation suggests using both.

What This Means

Skills are the most portable, most universal tool available for specializing AI agents. They work across models, across agents, across platforms. That’s rare in an ecosystem where most tools lock you into a specific vendor.

But portability without quality is just distributing problems faster. A bad skill deployed across a team’s agents doesn’t fail loudly. It fails quietly, across every session, degrading output in ways nobody measures because nobody has a baseline.

The evaluation infrastructure now exists. Anthropic, OpenAI, and Tessl each built their version in the first quarter of 2026. The question isn’t whether the tooling is ready. The question is whether we’ll apply the same discipline to skills that we apply to every other dependency in our stack.

What does your team’s skill quality process look like? Are you measuring impact, or still iterating by feel?

Context Engineering for AI-Assistant Development

Alex Robinson — Sun, 08 Mar 2026 15:51:25 GMT

AI coding tools have a context problem. Every token loaded into a context window competes for the model’s attention. Stanford and Berkeley research documented the “lost-in-the-middle” phenomenon: LLMs perform best when relevant information is at the beginning or end of the input, with significant degradation when it is buried in the middle. Practitioners have found that models become unreliable when context usage exceeds 40-60% of maximum capacity. A model claiming 200K tokens becomes unreliable around 130K.

So what we include matters. When we include it matters. How we organize it matters. Anthropic frames the challenge clearly: building effective AI agents is less about finding the right words and more about answering the question, “What configuration of context is most likely to generate the desired behavior?”

We’ve gone from one tool with one config file to three or four AI coding tools, each with its own conventions for context files, skill directories, and configuration formats. Claude Code reads CLAUDE.md. Gemini CLI reads GEMINI.md. Codex CLI reads AGENTS.md. Copilot wants .github/copilot-instructions.md. Maintain separate configs for each tool, and you’re managing the same knowledge in multiple places. Change a convention in one file, forget to update another, and your tools start contradicting each other.

Thoughtworks’ analysis of context engineering for coding agents found that the number of configuration options has exploded, with Claude Code leading innovations and other tools quickly following. The fragmentation creates a real maintenance burden.

Don’t pick one tool. Build a shared knowledge architecture that any tool can consume, with thin wrappers for tool-specific quirks. After a year working with Claude Code, Codex CLI, and Gemini CLI across a modular codebase, I’ve converged on a four-layer approach that separates concerns and keeps things tool-agnostic.

A Layered Approach

Two dimensions matter for understanding how context reaches the model.

The first is what decides to load it. Böckeler’s analysis identifies three triggers: the agent software automatically loads some context on every interaction (your AGENTS.md), the LLM decides to load other context when it judges it relevant (skills, MCP tools), and the developer explicitly triggers the rest (slash commands, manual file references). This distinction matters because it determines reliability. Agent-loaded context is deterministic. LLM-loaded context is probabilistic; the model might not activate a skill when you expect it to.

The second is whether execution is probabilistic or guaranteed. Most of this architecture is probabilistic guidance. Project knowledge and skills describe what the model should do, and it probably follows them. Probably. Hooks are different. They fire deterministically, executing shell commands at lifecycle boundaries regardless of the model’s decision. You can write “always run Prettier after editing files” in your AGENTS.md, and the model will follow it most of the time. A hook guarantees it.

These two dimensions are organized into four layers:

The dependency direction flows upward. Project Knowledge orients the AI on every interaction. Tools provide raw capabilities. Hooks enforce deterministic behavior at lifecycle boundaries. Skills bundle domain knowledge with tool references and load on demand.

Layer 1: Project Knowledge

Project Knowledge answers “where am I and what matters here.” It describes the architecture, module boundaries, conventions, and domain concepts. Without it, an AI tool has capabilities but no orientation.

AGENTS.md is the portable format: a markdown file at the repository root and optionally in module directories. The AGENTS.md specification emerged in mid-2025 as a vendor-neutral standard, now stewarded by the Agentic AI Foundation under the Linux Foundation. It’s been adopted by over 60,000 repositories and is supported by Codex CLI, Cursor, GitHub Copilot, and others. Claude Code and Gemini CLI can be configured to read it.

The format cascades. An agent reads the root AGENTS.md first, then the local one for the current directory. This mirrors how .gitignore works: the closest file takes precedence. A root file describes the overall architecture and conventions. Module-level files describe internal structure, dependencies, and domain concepts specific to that module.

Keep it concise. There’s a practical constraint: frontier models can reliably follow roughly 150-200 instructions before consistency degrades. Smaller models handle fewer. Every instruction in your AGENTS.md competes with every other instruction for the model’s attention. Community advice converges on keeping AGENTS.md files under 150 lines. A 500-line AGENTS.md with embedded examples, full API schemas, and copy-pasted documentation doesn’t make the model smarter. It makes it worse. Describe the module’s purpose, boundaries, and conventions, then point to skills for detailed patterns.

Where a tool requires its own context file, the tool-specific file should reference AGENTS.md and add only tool-specific instructions. One source of truth, thin wrappers.

# CLAUDE.md
See AGENTS.md for project architecture and module context.

## Tool-Specific Notes
- MCP servers available: Filesystem, GitHub
- Skills directory: `.agents/skills/`

For Gemini CLI, configure it to read AGENTS.md directly by adding {"context":{"fileName":["AGENTS.md"]}} to .gemini/settings.json. This eliminates the wrapper entirely.

Path-Scoped Rules

Some tools support path-specific conventions that fire when the AI encounters a particular file type. Think of them as file-type checklists: “when you see this kind of file, check these things.” These complement the broader project knowledge without duplicating it.

This is the least portable part of the architecture. Copilot uses .github/instructions/*.md with applyTo frontmatter. Claude Code uses .claude/rules/. Cursor uses .cursor/rules/*.mdc. Other CLI tools don’t support path-scoped rules at all. Author them in the format of your primary tool and accept this as tool-specific.

Rules are checklists, not architecture descriptions. They don’t duplicate what AGENTS.md covers. Each rule file stays under 20 lines and delegates to skills for detail. If a rule file starts describing module purpose, dependencies, and domain concepts, that content belongs in AGENTS.md.

---
applyTo: ["**/*Controller.ts", "**/*Service.ts"]
---
# Service Files
- Single responsibility per service
- Dependencies injected via constructor
- Error handling with typed exceptions
- Service under 300 lines

→ Detailed patterns: `.agents/skills/backend/`

Layer 2: Tools

Tools provide raw capabilities: querying systems, running builds, searching code, and linting files. They’re primitives that don’t know about project conventions. Skills compose them.

MCP is the standard transport for external tool connections. Every major tool supports stdio-based MCP servers. Build an MCP server once, and it works across Claude Code, Codex CLI, and Gemini CLI.

But MCP has a cost. Every connected MCP server loads its tool definitions into the context window upfront, whether you need them or not. Anthropic’s guidance on MCP efficiency documents the problem: as connected tools grow, loading all definitions slows agents and increases costs. Connect several MCP servers, and you can burn 50,000+ tokens before the conversation starts. That’s context budget consumed by tool descriptions rather than project knowledge.

Be selective. Connect only the MCP servers that the current workflow requires. If a tool has deterministic I/O and doesn’t need runtime queries, a shell script is cheaper than an MCP server. Reserve MCP for external APIs, databases, and services that need dynamic interaction.

Layer 3: Hooks

Hooks are the only deterministic layer in this architecture. They’re shell commands that fire at specific lifecycle events regardless of what the model decides. A PostToolUse hook that runs your formatter after every file write doesn’t depend on the model remembering to format. It runs every time.

As Anthropic’s documentation puts it: “Prompts are great for suggestions; hooks are guarantees.”

Claude Code and Gemini CLI both support hooks with similar event models. Claude Code offers PreToolUse, PostToolUse, SessionStart, and Stop events. Gemini CLI added hooks in January 2026 with equivalent events: BeforeTool, AfterTool, BeforeAgent, AfterAgent. The hook scripts themselves are fully portable shell commands. Only the configuration wrapper differs.

The most practical hook patterns fall into three categories. Formatting and linting: run your code formatter after every file write so the model never produces code that violates style rules. Safety gates: block dangerous operations like rm -rf, writing to production configs, or committing secrets. Context injection: load recent git commits, open tickets, or environment state at session start so the model has fresh context without you pasting it manually.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "npx prettier --write \"$(jq -r '.tool_input.file_path')\""
          }
        ]
      }
    ]
  }
}

One caution: hooks run with your full user permissions. There’s no sandbox. Treat hook scripts like production code. Validate inputs, use absolute paths, and be deliberate about what you allow.

Layer 4: Skills

Skills are where this architecture pays off. A skill bundles everything an AI needs to understand a domain and do work in it: reference documentation, checklists, templates, and the tool invocations required to complete actions.

The Agent Skills specification was released as an open standard in late 2025 and has been adopted by Microsoft, OpenAI, Atlassian, Figma, Cursor, and GitHub. The format is deliberately simple: a directory containing a SKILL.md file with YAML frontmatter and markdown instructions, plus optional scripts and references.

Skills solve the context budget problem with progressive disclosure. At startup, the model sees just the skill’s metadata, roughly 100 tokens describing what it does. The full instructions load only when the model determines the skill is relevant. The specification recommends keeping SKILL.md instructions under 5,000 tokens. Additional references load on demand, so smaller files mean less context consumption.

skills/backend/
├── SKILL.md              # What this skill knows and can do
├── references/           # Patterns, examples, anti-patterns
│   ├── patterns.md       # Canonical patterns (single source of truth)
│   └── anti-patterns.md  # What to avoid
├── checklists/           # Structured criteria
│   └── service-review.md # Applies references during review
└── templates/            # Scaffolding
    └── service.md        # Applies references during generation

Organize by domain, not by activity. A react-review skill and a react-generationskill would duplicate the same pattern knowledge. Instead, a single react skill contains all the patterns, with separate checklists for review and templates for generation.

Token generation is probabilistic. Scripts are deterministic.

Skills can also leverage scripts for deterministic tasks. When you need exact behavior, like validating against a checklist or applying a specific code transformation, the skill runs a script rather than asking the model to generate the logic each time.

Making It Portable

Author content in generic formats, store it in one location, and point tools to that location.

The obvious approach is to symlink from every tool-specific directory to one canonical location. This creates a problem. VS Code automatically scans .github/skills/, .claude/skills/, and .agents/skills/. If the same skill appears in multiple directories via symlinks, it shows up multiple times in the slash command menu. Duplicate skill names cause .github/skills/ to take precedence, but the duplicates clutter the UI and waste the discovery budget.

The cleaner approach: pick one canonical directory and configure tools to read from it. .agents/skills/ is the natural choice since it’s the vendor-neutral location that Codex CLI scans natively. For tools that don’t scan .agents/skills/ by default, use a single symlink from their expected location. Avoid creating parallel paths that VS Code will discover independently.

project/
├── AGENTS.md                          ← Portable context (all tools)
├── CLAUDE.md                          ← Imports AGENTS.md + Claude notes
├── .mcp.json                          ← MCP server config (Claude Code)
├── .agents/
│   └── skills/                        ← Canonical skill location
│       ├── backend/
│       ├── frontend/
│       └── testing/
├── .claude/
│   └── skills -> ../.agents/skills    ← Symlink (Claude Code)
├── .gemini/
│   ├── skills -> ../.agents/skills    ← Symlink (Gemini CLI)
│   └── settings.json                  ← Reads AGENTS.md
├── .github/
│   └── instructions/                  ← Path-scoped rules (Copilot)
└── modules/
    └── auth/
        └── AGENTS.md                  ← Module-specific knowledge

.github/skills/ is absent. VS Code already discovers .agents/skills/ and .claude/skills/, so adding a third symlink in .github/skills/ would triple every skill in the dropdown. If your primary tool is VS Code with Copilot, you could use .github/skills/ as the canonical location and symlink from .claude/ and .agents/instead. One real directory and symlinks where the tool can’t discover the canonical location.

Portability varies by layer. Project Knowledge and Skills are highly portable because AGENTS.md and SKILL.md are cross-tool standards. Tools are highly portable because MCP is the universal transport protocol. Hooks have moderate portability: Claude Code and Gemini CLI use similar event models, but Codex CLI doesn’t support hooks yet. Path-scoped rules have the lowest portability, with no universal standard across tools.

Where This Breaks Down

The four layers handle static context well: the project knowledge, conventions, and domain expertise that don’t change within a session. They don’t solve the harder problems that emerge during extended work.

Long sessions and context rot. As a session accumulates conversation history, tool outputs, and intermediate results, the model’s effective working memory shrinks. Chroma Research documented this: related-but-irrelevant content is worse than random noise because it looks relevant enough to distract. No amount of AGENTS.md tuning fixes a context window that’s 80% full of stale conversation.

Trajectory poisoning. When a session goes wrong, errors become context for subsequent reasoning. The model builds on its own mistakes, and no amount of correction fixes a poisoned trajectory. The context itself is the problem.

The reasoning cliff. As tasks require more reasoning steps or involve more interacting variables, model accuracy doesn’t decline gradually. It collapses. Apple’s GSM-Symbolic research found that adding a single irrelevant clause to a math problem led to performance drops of up to 65% across all models. You can’t fix a task that’s past the cliff by prompting better. You have to simplify the task itself.

Cross-agent handoffs. Subagents help by providing a clean context window for each step. But information gets lost at the boundary. The summary a subagent returns is never as rich as the full context it operated in. Important nuances, edge cases discovered during the investigation, and contextual reasoning are all compressed into a few hundred tokens.

What Practitioners Actually Do

The teams getting consistent value from AI coding tools have developed instincts that complement the static architecture.

Start fresh when quality degrades. The most practical response to trajectory poisoning is simple: commit your work, close the session, and start a new one with a clean context. Treat sessions like git branches. When one goes sideways, abandon it rather than trying to salvage the accumulated mess.

Keep tasks small. Multi-file benchmark data shows 87% success on single-file tasks, dropping to 19% on multi-file infrastructure work. Limit each task to touching 2-3 files. If you can’t explain what the AI should do in a sentence or two, the task is too big. Split it.

Use subagents for context isolation, not role-playing. Agents aren’t about personas. Their value lies in constraining which context loads at each step. A research subagent pulls documentation and API references. An implementation skill loads only the module being modified. Think of subagents as context boundaries.

Plan before prompting. Write your own requirements. Make your own architecture decisions. Draft your own test cases. Then use the AI for implementation. As Osmani describes it, start by brainstorming a detailed specification with the AI, outlining a step-by-step plan, and then writing code. The upfront investment feels slow but prevents wasted cycles when the model goes off-track.

Getting Started

Start simple. Create an AGENTS.md at your project root with the architecture summary, module map, and key conventions. Whenever you find yourself giving the AI the same instruction twice, add it to AGENTS.md instead.

Then identify your first skill. Pick the domain where your team spends the most time correcting AI output. If the AI keeps generating code that violates your architecture patterns, that’s your first skill. Encode the patterns, create a checklist, and point to example files in the codebase.

Set up the symlink structure early. Create .agents/skills/ as your canonical location, then symlink from .claude/skills/ and .gemini/skills/. One-time cost that pays off every time you add or modify a skill.

Add layers incrementally. Path-scoped rules when you notice file-type-specific mistakes. Hooks when you find yourself repeating “please run the formatter” in every session. Don’t build the full architecture before you’ve validated that the first two layers are working.

Monitor what’s actually reaching the model. Claude Code’s /context command shows what’s loaded and how much space each piece consumes. Gemini CLI’s /stats provides similar visibility. If your AGENTS.md and MCP tool definitions are consuming 40% of the context window before you’ve typed anything, the architecture needs trimming. The best context engineering is invisible when it works and diagnosable when it doesn’t.

Treat these files like code. Subject them to code review, version them, and update them when conventions change. Teams that get consistent value from AI tools invest in maintaining this context architecture alongside their codebase.

Where the Standards Are Headed

The standardization landscape is moving fast. AGENTS.md and the Agent Skills format have broad adoption, but both specifications are young. A v1.1 proposal for AGENTS.md is working to codify implicit behaviors around discovery, layering, and precedence that the community has adopted but never documented. The community is still debating whether AGENTS.md should eventually be merged into README.md.

Some questions remain open. How detailed should module-level AGENTS.md files be before they become more noise than signal? When does a skill’s reference documentation grow large enough to need its own progressive disclosure? How do you measure whether a skill is actually improving AI output quality versus just adding tokens?

The fundamental tension remains: we’re building static architectures for tools whose hardest problems are dynamic. The four layers give you the best possible starting position for each session. What happens during the session still depends on task decomposition, session hygiene, and engineering judgment.

The tools improve constantly. The constraints will shift. But the core challenge won’t change: AI coding tools are only as good as the context in which they operate. Managing that context with intention, rather than hoping the model figures it out, is how you get consistent value from these tools.

I’m curious how others are handling the multi-tool fragmentation problem. Are you maintaining separate configs per tool, or converging on a shared architecture?

Signal Check: February 20, 2026

Alex Robinson — Fri, 20 Feb 2026 13:02:35 GMT

February 20, 2026

The psychological cost of AI adoption got a name this week, along with the data to back it up. Developers are using AI tools more than ever while trusting them less, and researchers are starting to explain why the gap keeps widening.

Deep Blue

Simon Willison

Willison and the Oxide and Friends podcast coined a term for the existential dread software developers feel watching AI close in on their work: “Deep Blue.” Named after the chess machine. Not a doomer take or an optimist spin. A practitioner saying, “This feeling is real, and we should talk about it.”

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt

Margaret-Anne Storey

Technical debt lives in the code. Cognitive debt lives in developers’ heads. Storey watched a student team hit a wall at week 8: they could no longer make simple changes because nobody could explain why the system was built the way it was. Shared understanding evaporated faster than the code quality. When agents write the code, the mental model for understanding it never forms.

Mind the Gap: Closing the AI Trust Gap for Developers

Stack Overflow

AI usage rose to 84%. Trust dropped to 29%. Stack Overflow digs into why familiarity breeds skepticism rather than confidence. The core tension: developers are trained for deterministic thinking, and AI is probabilistic. Same question, two different answers, both plausible. That variability violates foundational expectations about how tools should behave.

AI-Assisted Reviews with GitHub Copilot

Alex Robinson — Tue, 17 Feb 2026 13:02:33 GMT

In my last article, I argued that engineering teams should fix the review process before scaling code generation. AI review tools won’t solve the capacity problem, but they handle what humans struggle to sustain: speed, consistency, and pattern recognition.

This is the practical follow-up. Copilot code review has added path-specific instructions, agentic tool calling, a CLI code-review agent, and local IDE review since going GA. Most teams install it, leave the defaults, and wonder why it floods their PRs with noise. Configuration separates a tool the team ignores after a month from one that sticks.

Instruction Files: Two Layers

Copilot’s instruction system has two layers. Repository-wide instructions live in .github/copilot-instructions.md and apply to every review. Path-specific instructions live in .github/instructions/ and target file patterns through YAML frontmatter.

Encode your universal standards in the repository-wide file. Write short, imperative directives rather than paragraphs. Copilot processes “Flag hardcoded API keys or credentials” far more reliably than “Please be careful to look for any secrets that might have been accidentally committed.”

GitHub’s own guide is explicit: vague directives like “be more accurate” add noise that confuses the LLM.



# Code Review Instructions

## Security
- Flag hardcoded API keys, tokens, or credentials
- Check that user input is validated before use
- Verify that sensitive data is not logged or exposed in error messages

## Error Handling
- Verify errors are handled, not silently ignored
- Check that error messages provide useful context for debugging
- Flag empty catch blocks

## Quality
- Flag functions longer than 40 lines
- Flag deeply nested logic (more than 3 levels)
- Check for missing nil/null guards on optional values

## Do Not Comment On
- Code formatting or style (handled by linters)
- Import ordering
- Trailing whitespace or line length

Keep this file under 1,000 lines. The “Do Not Comment On” section matters. Every low-value comment competes for attention with the ones that count. If your linters already handle formatting, tell Copilot explicitly.

Path-specific instructions target parts of the codebase where generic rules fall short. Each file uses an applyTo frontmatter property with glob patterns. The excludeAgent property controls which Copilot agent reads which file, so you can run different rules for the code review agent and the coding agent.


---
applyTo: "**/Views/**/*.swift"
excludeAgent: "coding-agent"
---

# View Layer Review Standards

- Flag views exceeding 30 lines of body content
- Check that views do not make network calls directly
- Verify accessibility modifiers are present on interactive elements
- Flag any business logic in view code; it belongs in the view model
- Check that navigation is handled through the coordinator, not inline


---
applyTo: "**/Tests/**/*.swift"
---

# Test Review Standards

- Verify each test method tests a single behavior
- Check that test names describe the expected behavior
- Flag tests without assertions
- Flag any test that hits the network or file system without mocking
- Check that test data is created within the test, not shared across tests

Organize by concern: views, networking, models, tests, security. GitHub recommends separating topics into distinct instruction files rather than cramming everything into one. A typical iOS project:

.github/
├── copilot-instructions.md          # Repository-wide standards
└── instructions/
    ├── views.instructions.md         # View layer conventions
    ├── networking.instructions.md    # API and networking patterns
    ├── models.instructions.md        # Data model conventions
    ├── tests.instructions.md         # Testing standards
    └── security.instructions.md      # Security-specific checks

Linters Handle Rules, AI Handles Judgment

Don’t let Copilot duplicate what linters already catch. Linters enforce formatting, flag syntax errors, and check naming conventions deterministically. They’re fast, consistent, and produce zero false positives for well-defined rules.

Copilot’s strength is semantic analysis: logic correctness, edge cases, security vulnerabilities, and fuzzy checks that depend on context. Can a function name communicate its intent? Does the error handling account for this specific situation? Does this change duplicate logic from another module? Linters can’t make those calls.

The Goose project maintainers discovered this when they enabled Copilot code review and the other maintainers said the results were too noisy. The fix was telling Copilot exactly what the CI pipeline already covers:

## CI Pipeline Context

Important: You review PRs before CI completes.
Do not flag issues that CI will catch.

### What Our CI Checks
- cargo fmt --check
- cargo test
- clippy lints
- npm run lint:check

## Skip These (Low Value)
Do not comment on:
- Style/formatting (handled by rustfmt, prettier)
- Clippy warnings
- Test failures
- Missing dependencies

They also set a confidence threshold: “Only comment when you have HIGH CONFIDENCE (>80%) that an issue exists. Be concise: one sentence per comment when possible.” That alone cut the noise dramatically.

The filtering pipeline looks like this: linters catch mechanical issues in CI, Copilot handles semantic analysis, human reviewers focus on architecture, business logic, and mentoring.

Tuning: Start Minimal and Iterate

Research from Cubic found that up to 40% of AI code review alerts get ignored. A developer who receives fifteen low-value comments on their first AI-reviewed PR will ignore comment sixteen, even if it’s the one that matters.

Start with five to ten rules that address your most common review feedback. Add rules one at a time and observe the results. If Copilot flags something your team doesn’t care about, add an explicit exclusion.

Show, don’t describe. Copilot is better at mimicry than interpretation. Telling it “prefer protocol-based dependency injection” may or may not flag violations. Show it a concrete example of the wrong approach alongside the right one, and accuracy improves noticeably.

## Dependency Injection

Prefer protocol-based dependency injection over concrete types.

Bad:
class ProfileViewModel {
    let service = UserService()
}

Good:
class ProfileViewModel {
    let service: UserServiceProtocol
    init(service: UserServiceProtocol) {
        self.service = service
    }
}

Watch for hallucinations. Copilot will invent concerns that don’t exist in the code, and vague instructions make this worse. The more specific your directives, the less room the model has to fabricate.

GitHub warns against including external links (Copilot won’t follow them) or requesting product behavior changes like blocking merges or altering comment formatting. Stick to what it can do: analyze code and leave comments.

Review Before the PR Exists

Most teams overlook Copilot’s ability to review locally, before code reaches a pull request.

In the CLI, the /review slash command analyzes staged or unstaged changes without leaving the terminal. Start an interactive copilot session in your project directory and run /review. Copilot delegates to a specialized code-review agent that focuses on surfacing genuine issues rather than style nitpicks. It reads the same .github/copilot-instructions.md from your repository.

# Start an interactive Copilot session in your project
$ copilot

# Review current changes (staged or unstaged)
> /review

# Target a specific branch diff and focus area
> /review Review changes in my current branch against main. Focus on security issues.

The /review command runs inside an interactive session, not as a standalone flag. You can specify what to focus on. The code-review agent can also run in parallel with other specialized agents (Explore, Task, Plan), so a complex debugging session might analyze code, run tests, and review changes concurrently.

In VS Code, open the Source Control view, hover over “Changes,” and click “Copilot Code Review.” Copilot reviews staged or unstaged changes and leaves inline comments using the same instruction files from your repository.

In JetBrains, open the Commit tool window and select “Copilot: Review Code Changes” to get feedback before committing.

This shifts feedback earlier, while the code is still fresh in the developer’s mind. Cleaner PRs follow, and human reviewers stop wasting time on issues Copilot could have caught locally.

Put Review Logic in Skills, Not Just Instructions

Agent Skills go deeper than instruction files. Skills are folders containing instructions, scripts, and resources that Copilot loads when relevant. They work across VS Code, the CLI, and the coding agent. Where instruction files provide guidelines, skills enable specialized workflows with procedural knowledge and deterministic scripts.

There’s a practical reason to prefer skills over instruction files for the bulk of your review logic: portability. Instruction files in .github/copilot-instructions.md are GitHub-specific. They work with Copilot and nothing else. Skills follow the agentskills.io open standard, which means the same skill folder works across any tool that supports the format. If your team uses Claude Code alongside Copilot, or switches between Cursor and the CLI, review standards encoded as skills travel with you. Instruction files don’t.

Keep the repository-wide instruction file thin. Use it for Copilot-specific behavior: what to skip, confidence thresholds, response format. Move the substantive review logic, your team’s conventions, security checks, and architecture rules into skills that any agentic tool can consume.

You can get the best of both worlds by using path-specific instruction files to point Copilot toward portable skills. Since instruction files and skills are both natural language that the LLM reads and follows, an instruction file can reference a skill by name or path, and Copilot will load and use it. This gives you path-scoped triggering from instruction files with portable review logic in skills:


---
applyTo: "**/Views/**/*.swift"
---
When reviewing SwiftUI views, use the swiftui-review skill
for team conventions and accessibility checks.
Flag any UIKit usage in SwiftUI view files.

The instruction file stays thin: a few lines of path-specific context. The skill holds the detailed review logic and works across Copilot, Claude Code, and any other tool that supports the agentskills.io standard.

A code review skill for your team might encode the checks that matter most:

code-review-skill/
├── SKILL.md
├── SECURITY.md
└── scripts/
    └── check_conventions.sh


---
name: team-code-review
description: >
  Review code changes against team conventions and security standards.
  Use when asked to review code, PRs, or diffs.
---

# Code Review Skill

## Review Process
1. Run scripts/check_conventions.sh on changed files
2. Review SECURITY.md for security-specific checks
3. Evaluate logic correctness and edge case handling
4. Check for duplicated logic across the codebase

## What to Flag
- Functions exceeding cyclomatic complexity of 10
- Missing error handling on network calls
- Force unwraps in production code (test code is fine)
- Direct dependency on concrete implementations

## What to Skip
- Formatting issues (handled by SwiftLint)
- Import ordering
- Minor naming preferences

The scripts/check_conventions.sh handles deterministic checks that don’t need an LLM. Token generation is probabilistic; scripts are deterministic. Use scripts for validation (lint checks, schema conformance), transformation (reformatting, boilerplate updates), and integration (API calls with specific auth requirements).

Skills use progressive disclosure. At startup, the model sees only the skill’s name and description from the YAML frontmatter, roughly 100 tokens. The full instructions load only when the skill is relevant. Additional files like SECURITY.md load on demand, keeping context lean until expertise is needed.

For Copilot specifically, store skills in the repository (.github/skills/) for your team, or in your home directory (~/.copilot/skills/) for personal use across projects. For broader use, keep skills in a shared repository that any tool can reference. They’re just folders, so they work with git and can be shared through whatever mechanism your team already uses.

Automate PR Reviews

Repository rulesets trigger Copilot review automatically. Go to Settings > Rules > Rulesets and create a new branch ruleset. Under “Branch rules,” select “Automatically request Copilot code review.” Two subsettings control the behavior:

Review new pushes re-runs Copilot review when new commits land on the PR, so feedback stays current as the code evolves. Without this, Copilot only reviews once at PR creation.

Review draft pull requests triggers reviews on drafts so authors can iterate with Copilot before requesting human review. This pairs well with local review: catch what you can locally, push a draft, let Copilot do a full pass, then mark ready for review.

Organization owners can apply rulesets across multiple repositories using pattern matching (*feature matches all repository names ending in “feature”). This rolls out Copilot review consistently without per-repo configuration.

Copilot code review also integrates CodeQL for security analysis (enabled by default) and optionally ESLint and PMD. These tools run alongside the AI review, combining deterministic security scanning with probabilistic code analysis.

Know If It’s Working

Atlassian cut PR cycle time by 45% by making Copilot the automated first reviewer on every PR. Their 18-hour average wait for first feedback dropped to minutes. New engineers merged their first PR five days faster.

But faster individual throughput doesn’t guarantee better team outcomes. The 2025 DORA report found that individual developers merged 98% more PRs while organizational delivery stability decreased 7.2%. Cortex’s 2026 benchmark found incidents per PR up 24% and change failure rates up 30% with AI adoption.

What GitHub’s Dashboard Tells You

GitHub’s Copilot usage metrics dashboard, currently in public preview for Enterprise customers, tracks four categories: daily and weekly active users, code completion acceptance rates, chat interactions by mode (Ask, Edit, Agent), and agent adoption percentage. A separate code generation dashboard breaks down lines of code changed by users versus agents, grouped by model and language.

The usage metrics API provides user-level granularity through JSON exports of user_initiated_interaction_count, code_acceptance_activity_count, and lines of code suggested versus accepted. Organization-level analytics arrived in December 2025. Team-level data is accessible through the Copilot Metrics API. All of these derive from IDE telemetry, so users must have telemetry enabled to appear in reports.

This data answers adoption questions well. You can see who’s using Copilot, how often, in which IDEs, and with which models. You can spot teams with low adoption or users who interact frequently but rarely accept suggestions. For a rollout, these are useful leading indicators.

What It Doesn’t Tell You

None of GitHub’s built-in metrics track code review activity. No dashboard shows how many PRs Copilot reviewed, how many comments it left, how often authors accepted or dismissed suggestions, or the ratio of actionable feedback to noise. Nothing links Copilot usage data to PR workflow outcomes such as cycle time, change failure rate, or reviewer load distribution.

That’s a significant gap. These metrics measure whether people use Copilot, not whether it helps. Acceptance rate for code completions tells you suggestions are relevant. It says nothing about whether the code that ships is better or whether reviewers spend less time on routine checks.

Measuring What Matters

For code review, you’ll need to instrument your own measurements. GitHub’s Pull Request API and Reviews API provide the raw data. Track these before enabling AI review and compare after.

PR cycle time. Time from PR creation to merge. This is the headline metric. If AI review works, the wait for first feedback drops and the overall cycle compresses.

Reviewer load distribution. How many reviews each team member performs. AI review should flatten the curve, reducing the burden on the one or two senior engineers who currently review everything.

Actionable comment rate. How many AI comments developers address versus dismiss. This is the best signal for instruction quality. If the team ignores most of Copilot’s feedback, the instructions need work, not the team.

Change failure rate. Deployment failures or incidents tied to merged PRs. If this increases alongside faster cycle times, you’re trading quality for speed. The DORA and Cortex findings suggest this is the default outcome without deliberate quality gates.

Combine GitHub’s usage metrics API with your PR data for a fuller picture. Correlating code_acceptance_activity_count with PR cycle time per user reveals whether developers who engage more with Copilot also ship faster, or just generate more code that sits in review.

Run a focused experiment. Pick a two-week sprint. Measure current cycle times and reviewer load. Enable Copilot review on routine code first, where a false positive costs little and the team can calibrate without pressure. If the numbers improve, expand. If they don’t, tune the instructions before scaling.

Start Simple

The instinct is to write exhaustive instruction files covering every convention your team has discussed. Resist it. Start with the repository-wide file and five to ten rules. Add path-specific instructions only after the base is stable. Introduce skills when workflows justify the complexity. Expand based on what Copilot gets wrong.

The teams getting consistent value from AI code review share one trait: they treat configuration as a living document, not a one-time setup. The bottleneck was never writing code. It was proving the code works.

Implementation Checklist

Baseline

Measure current PR cycle time across the team, from creation to merge. Record the median and 90th percentile.

Identify your reviewer load distribution. Who reviews the most PRs? How does your top reviewer’s load compare to the team average?

Document your current change failure rate. Without a baseline, you won’t know whether AI review improves or degrades quality.

Confirm your linting and CI pipeline catches formatting, syntax, and known anti-patterns. AI review should never duplicate what deterministic tools already handle.

Configure

Create .github/copilot-instructions.md with five to ten rules targeting your most common review feedback. Keep it under 1,000 lines. Include a “Do Not Comment On” section for anything linters already cover.

Add path-specific instruction files in .github/instructions/ only after the base file is stable. Organize by concern: views, networking, models, tests, security.

Include concrete code examples showing correct and incorrect patterns. Copilot is better at mimicry than interpretation.

Set a confidence threshold. Tell Copilot to comment only when confidence is high and to keep comments concise.

Pilot

Enable Copilot review on routine code first, where false positives cost little. Use repository rulesets to automate review on a subset of branches or repositories.

Run the pilot for at least two weeks. Collect feedback from PR authors and human reviewers on comment quality and noise.

Track the actionable comment rate. If the team ignores most feedback, revise the instructions before expanding.

Encourage local review in the IDE or CLI before pushing. Cleaner PRs mean less noise for both AI and human reviewers.

Expand

Once the pilot is stable, enable automatic review on draft PRs so authors iterate with AI feedback before requesting human review.

Add skills for workflows complex enough to justify them: security checks, architectural conventions, team-specific patterns. Store them in a shared repository so they work across Copilot, Claude Code, Cursor, and other agentic tools.

Apply rulesets across repositories using organization-level pattern matching.

Monitor GitHub’s usage metrics dashboard for adoption trends. Correlate code_acceptance_activity_count with your PR cycle time data to see whether engagement translates to faster delivery.

Sustain

Treat instruction files as living documents. When Copilot flags something your team ignores, add an exclusion. When it misses something important, add a directive.

Review metrics monthly against your baseline. If the change failure rate climbs alongside faster cycle times, tighten quality gates before scaling further.

Verify senior engineers spend less time on repetitive checks and more on architecture, business logic, and mentoring. That outcome justifies the investment.

Never commit code you can’t explain.

Signal Check: February 13, 2026

Alex Robinson — Fri, 13 Feb 2026 13:15:16 GMT

We’re two years into mainstream AI adoption and still figuring out the basics. How does it change workloads? Who’s accountable for agent-written code? This week brought a rigorous eight-month study, a legal analysis, and a practitioner reckoning that all point to the same conclusion: the tooling is ahead of our understanding of how to use it.

AI Doesn’t Reduce Work, It Intensifies It

Aruna Ranganathan & Xingqi Maggie Ye, Harvard Business Review

UC Berkeley tracked 200 employees for eight months. AI didn’t give time back. Workers expanded their scope, absorbed other people’s responsibilities, and stopped noticing where work ended. One engineer nailed it: “You had thought that maybe you could work less. But then, really, you don’t work less.” The Hacker News thread asked the obvious follow-up. If AI is so effective at reducing work, why are workloads increasing for companies that adopt it?

There Is No Skill in AI Coding

Mo Bitar

Bitar takes Karpathy’s honest assessment of agentic coding and turns it into a performance review. The comments are worth reading too. One reader compared AI to a leavening agent: too much and your bread implodes. Another pointed out that syntax memorization, the thing AI supposedly replaces, was already handled by LSPs and autocomplete years ago. The real work was never typing.

Built by Agents, Tested by Agents, Trusted by Whom?

Stanford Law School CodeX

Stanford’s legal team asks the question most engineering orgs are quietly avoiding. When agents optimize for “pass the tests” rather than “build good software,” who answers for it? The piece is blunt about the stakes. Reading and writing code has been the bedrock of this profession for seventy years. If that skill becomes optional, we should determine what replaces it as the basis for accountability.

Karpathy: From “Vibe Coding” to “Agentic Engineering”

Andrej Karpathy, via X

Karpathy called the original “vibe coding” post “a shower of thoughts throwaway tweet” and seemed genuinely surprised that it defined a movement. The new term is more deliberate. The community split predictably between those who see a new discipline forming and those who see a rebrand of common sense. What’s worth paying attention to: Karpathy still acknowledges models make “subtle conceptual errors,” overcomplicate code, and won’t manage their own confusion. That’s honest, coming from the person who started all of this.

AI-assisted Code Reviews

Alex Robinson — Tue, 10 Feb 2026 13:03:11 GMT

The Pull Request process has been a bottleneck for years. AI code generation didn’t create the problem, but it’s making it worse.

Before reaching for AI tools to increase code volume, fix the review process.

There’s no point in generating code faster if it sits in a queue waiting for someone to look at it. That’s not a productivity gain. It’s a traffic jam.

The symptoms are familiar to anyone who’s worked on a team larger than a handful of engineers.

Wait times are the most visible problem. Atlassian’s engineers waited an average of 18 hours for a first review comment, with median PR-to-merge times over three days. A developer submits a PR, switches context to something else, and then has to reload everything when feedback finally arrives. Each switch costs roughly 23 minutes of refocusing time.

Review responsibility tends to concentrate among a few senior engineers. They know the codebase best, so they get tagged on everything. When only the tech leadconsistently reviews PRs, it creates a bottleneck that stalls the entire team. Senior engineers end up spending their time catching naming conventions and missing null checks rather than mentoring or solving harder problems.

Then there’s fatigue. SmartBear’s study of 2,500 code reviews at Cisco Systems found that defect detection drops sharply after 60 minutes of continuous review, and reviewing faster than 500 lines of code per hour causes a severe decline in effectiveness. When reviewers are exhausted, approvals become rubber stamps. The gate still exists, but it no longer catches problems.

AI coding tools increased the volume without increasing review capacity. GitHub’s Octoverse report shows monthly code pushes crossing 82 million, with about 41% AI-assisted. The number of human reviewers hasn’t changed.

The code itself is harder to review. AI-authored PRs contain 10.83 issues on average compared to 6.45 for human-written code. Security vulnerabilities appear 1.5-2x more often. And the code looks clean. AI output is polished enough to hide subtle errors, so every line needs scrutiny, and there are now more lines to review than ever before.

Increasing code volume in an already constrained review process doesn’t improve throughput. It degrades it.

AI code review tools won’t fix the fundamental capacity problem. But they address aspects of review that humans struggle to sustain: speed, consistency, and pattern recognition.

The biggest gain is eliminating the wait for first feedback. Atlassian cut PR cycle time by 45% by making AI the automated first reviewer on every PR. That 18-hour wait dropped to minutes. New engineers who used the AI reviewer merged their first PR five days faster. The code still needed human review, but authors could start fixing obvious problems immediately rather than sit idle.

Consistency is the less obvious but more durable gain. AI applies the same rules to every pull request. No bad days. No variation based on which reviewer picks up the PR. Traditional linters enforce formatting and reliably catch known anti-patterns; they should be the first layer of any review pipeline. But linters can’t evaluate whether a function name communicates intent, whether error handling is sufficient for the context, or whether a block of code duplicates logic from elsewhere. Custom lint rules are time-consuming to write and brittle to maintain. AI handles these fuzzy checks naturally. You describe the check in plain language, and it applies probabilistic judgment. Linters for deterministic rules, AI for pattern matching.

Think of it as triage. AI handles the first pass, catches common mistakes, and flags obvious security issues. By the time a senior engineer opens the PR, most of the noise is gone. They can focus on architecture, business logic, and mentoring.

AI operates within a limited context. Maintainability, scalability, and architectural fit demand judgment that current models can’t provide.

Senior engineers earn that title precisely because they connect code changes to business needs, architectural direction, and long-term system health. That connective reasoning is the review work that matters most. Graphite’s analysis put it simply: AI can tell you code is inefficient, but only a senior developer can explain why a different approach works better for your specific system.

Even a perfect automated review that caught every defect would still fall short. Preventing bugs isn’t the only reason for a code review. A human must be accountable for every change that reaches production. No tool changes that.

Code review also serves purposes that don’t show up in cycle time metrics. It’s how senior engineers mentor juniors, how teams share context, and how better solutions surface through discussion.

The instinct when AI tools arrive is to prioritize code generation. That’s backward. The constraint was never writing code. It was proving that the code works.

The organizations seeing results with their AI adoption started with review, not generation. Atlassian cut PR cycle time by 45% by making AI the automated first reviewer on every PR. The biggest gain wasn’t catching more bugs. It was eliminating the wait for the first feedback.

But there’s a trap. Out of the box, most AI review tools produce more noise than signal. They flag style preferences, suggest unnecessary refactors, and generate comments that don’t warrant attention. A developer who gets fifteen low-value comments on their first AI-reviewed PR will ignore comment sixteen, even if it’s the one that matters. Fine-tuning the AI instructions determines whether it becomes part of the workflow or is quietly disabled after a month.

The sequencing matters. Measure your current review process. Know your cycle times, your reviewer load distribution, and your change failure rate. Introduce AI review on routine code first, where the cost of a false positive is low, and the team can calibrate without pressure. Validate that senior engineers are spending less time on repetitive checks and more on architecture, business logic, and mentoring. Then, once the review infrastructure can handle increased volume, scale code generation.

Generating code faster is easy. The hard part is building the review process that keeps up.

The Critical Thinking Paradox

Alex Robinson — Mon, 02 Feb 2026 16:26:14 GMT

AI coding tools have a problem. They require strong critical thinking to use effectively, yet they erode the very skills on which they depend. The better AI gets at writing code, the worse we become at evaluating it.

When developers delegate thinking to AI, they shift from solving problems to orchestrating code production. Developers move from “independent thinking, manual coding, iterative debugging” to “AI-assisted ideation, interactive programming, collaborative optimization.” Researchers call this material disengagement. They selectively supervise rather than engage deeply.

Anthropic’s January 2026 study provides the evidence. In a randomized trial with 52 software engineers learning a new Python library, participants using AI scored 17% lower on comprehension tests than those who coded manually, with debugging as the largest gap. Developers using AI couldn’t identify errors or explain why the code failed.

The speed gains weren’t statistically significant. Participants using AI finished approximately two minutes faster on average, but spent up to 30% of their time composing queries. For learning tasks, AI slows you down while reducing what you learn.

Developers get trapped in a vicious cycle. They submit incorrect AI-generated code, ask the AI to fix it, receive more incorrect code, and repeat the process. Only a minority analyzes what the AI produces. Fixing the prompt feels easier than understanding the code.

Meanwhile, enterprises measure velocity rather than understanding. Stack Overflow’s 2025 survey found that 84% of developers now use AI tools, yet only 33% trust their accuracy. The top source of frustration, cited by 66%, is solutions that are “almost right, but not quite.” We optimize for the wrong metrics. Output without learning. Speed without comprehension.

Why Critical Thinking Matters

AI-generated code looks correct. That’s the problem. It compiles, passes basic tests, and reads like something a competent developer wrote. But CodeRabbit’s analysis of 470 pull requests found that AI-generated code produces 1.7x more issues than human-written code. Security vulnerabilities appear twice as often. Logic errors increase 75%. These aren’t obvious failures. They’re subtle defects that slip through review and surface in production.

AI doesn’t understand your codebase. It doesn’t know your architecture, business rules, or domain constraints. Veracode’s 2025 report found that AI introduced security vulnerabilities in 45% of coding tasks because models optimize for the shortest path to a passing result, not the safest one. The code works in isolation but ignores the system it’s joining. One development leader described it bluntly: “Very difficult to maintain, very difficult to understand, and in a production environment, none of that is suitable.”

The debt compounds. GitClear’s analysis of 211 million lines of code found that code duplication surged 8-fold between 2022 and 2024. Copy-pasted lines now exceed refactored lines. MIT professor Armando Solar-Lezama called AI a “brand new credit card that is going to allow us to accumulate technical debt in ways we were never able to do before.” A Harness survey found that 67% of developers now spend more time debugging AI-generated code than writing it themselves.

Critical thinking matters more now because the cost of skipping it is higher. Bad code from a junior developer triggers scrutiny. Confident-sounding code from AI bypasses it. The volume overwhelms reviewers. The plausibility defeats skepticism. Without deliberate evaluation, technical debt accumulates until the only path forward is a costly rewrite.

Inverting the Workflow

The current workflow runs backward. AI writes code and humans fix, review, and maintain it. This inverts learning and produces developers who can’t function when AI fails.

Addy Osmani describes a different approach: “AI-augmented software engineering rather than AI-automated software engineering.” The distinction matters. He uses AI aggressively while staying “proudly accountable for the software produced.”

The fix is simple in concept. Flip the workflow: humans plan and create, AI refactors and maintains, humans verify. Let AI handle the drudgery. Keep humans doing creative problem-solving that builds expertise.

Think Before You Prompt

Write your own requirements. Make your own architecture decisions. Draft your own test cases. Then use AI for validation: “Did you consider this option? Here are the trade-offs.” Get education, not answers.

Osmani’s workflow starts with brainstorming a detailed specification with the AI, then outlining a step-by-step plan, before writing any actual code. “This upfront investment might feel slow, but it pays off enormously.” He likens it to doing “waterfall in 15 minutes,” a rapid structured planning phase that makes subsequent coding smoother.

The key insight: if you ask the AI for too much at once, “it’s likely to get confused or produce a jumbled mess that’s hard to untangle.” Developers report AI-generated code that feels “like 10 devs worked on it without talking to each other.” The fix is to stop, back up, and split the problem into smaller pieces.

Apply Systematic Review

GitHub’s research found that developers using Copilot wrote 13.6% more lines of code with fewer readability errors, but only under rigorous human review. Use structured rubrics: inconsistent naming, unclear identifiers, excessive complexity, missing documentation, repeated code. Review is where learning happens.

Osmani treats every AI-generated snippet as if it were produced by a junior developer. “I read through the code, run it, and test it as needed. You absolutely have to test what it writes.” He sometimes spawns a second AI session to critique code produced by the first. “The key is to not skip the review just because an AI wrote the code. If anything, AI-written code needs extra scrutiny.”

Set Intentional Boundaries

A study found that participants who succeed limit AI’s contribution to approximately 30%. Reserve creative and high-level decisions for yourself. Use AI for boilerplate, not core problem-solving. After each interaction, ask: “Did I learn anything?”

Anthropic’s research revealed distinct interaction patterns with dramatically different outcomes. The lowest-scoring developers delegated the work entirely to AI. The highest-scoring developers asked only conceptual questions while coding independently. They used AI for understanding, not production.

The pattern that combined speed with strong learning: “Generation-then-Comprehension.” Generate code, then ask follow-up questions to understand it. Alternatively, the “Hybrid Code-Explanation” approach: request code and explanations simultaneously. Slower, but you actually learn.

Build Fast Feedback Loops

Generate code during planning to see implications early. Test AI output immediately. Don’t accumulate tech debt. Treat AI drafts as conversation starters, not final products.

Osmani weaves testing into the workflow itself. “If I’m using a tool like Claude Code, I’ll instruct it to run the test suite after implementing a task, and have it debug failures if any occur.” This tight feedback loop (write code, run tests, fix) works because the AI excels at it, provided the tests exist.

Without tests, the agent might assume everything is fine when in reality it’s broken several things. So invest in tests. It amplifies the AI’s usefulness and your confidence in the result.

Maintain Skills Deliberately

Spend a few hours weekly on manual coding. Read documentation, not AI explanations. Talk to both AI optimists and skeptics. The goal isn’t rejecting AI. It’s remaining competent when AI fails.

Osmani addresses this directly: “For those worried that using AI might degrade their abilities, I’d argue the opposite, if done right. By reviewing AI code, I’ve been exposed to new idioms and solutions. By debugging AI mistakes, I’ve deepened my understanding.”

The catch is the “if done right” part. You must stay informed, actively reviewing and understanding everything. Otherwise, you’re just outsourcing your judgment to a statistical engine.

The Stakes

We’re creating a workforce that generates code quickly but cannot debug when AI fails, cannot understand system architecture, cannot maintain code over the long term, and cannot develop the expertise required to become senior engineers.

Here’s the uncomfortable truth from Anthropic’s research: developers rate disempowering interactions more favorably. They actively seek complete solutions. They accept AI output without pushback. The disempowerment isn’t manipulation. It’s voluntary abdication.

But everything you’ve learned won’t be wasted. Problem-solving and decomposition matter forever. The question isn’t whether AI will keep improving. It will. The question is whether we improve alongside it.

Done right, AI handles the drudgery while we focus on creativity. It enables prototyping at the speed of thought. It frees mental energy for meaningful work. But only if we stay in the driver’s seat.

The developers who thrive won’t generate the most code or ship the fastest. They’ll be the ones who never stop thinking critically. They’ll treat AI as a powerful tool requiring careful thought, not a substitute for it.

Next time you reach for an AI coding tool, pause. Am I learning from this interaction, or outsourcing my thinking? Your answer determines whether AI amplifies or atrophies your abilities.

MCPs, Agents, Skills. Oh My!

Alex Robinson — Tue, 27 Jan 2026 13:36:54 GMT

LLMs are, without question, an impressive achievement at a scale few companies have the resources to pull off. However, when first introduced, LLMs had very limited practical applications. They’re unaware of the world after their training cutoff and unable to connect to tools that would let them accomplish real-world tasks. Ask a model to check your calendar, query a database, or search for recent news, and it can only apologize or hallucinate. The model has general intelligence but no hands to work with.

The model ecosystem has grown to include tools that allow the models to connect to the real world, moving beyond a static understanding. Most notably among these tools are MCPs, agents, and skills. Each of these tools solves a specific limitation, building toward AI tools that can actually get work done.

MCP: Connecting Models to the World

Model Context Protocol gives LLMs the ability to connect to external tools and data sources. Through MCP, a model can access your file system, query databases, call APIs, and search the web. The protocol standardizes these connections, enabling tool integrations to work across different models and platforms. Build an MCP server once, and it works with Claude, GPT, and any other model that supports the protocol.

This was a game-changer. MCP quickly became a de facto standard adopted across model providers and tools. Suddenly, AI assistants could do things rather than just discuss them.

But MCP brought its own problems. Every tool definition loads into the context window upfront, whether you need it or not. Connect several MCP servers, and you might burn 50,000+ tokens before the conversation even starts. This exacerbates issues with context windows, leaving less room for the actual work and accelerating context rot, the degradation in model performance as token counts climb.

Agents: Isolating Context

Agents address the context window pressure by allowing subprocesses to work with isolated context windows. When a lead agent spawns a subagent, that worker gets a fresh context, separate from the main conversation. The subagent might consume 100,000 tokens investigating a problem, but returns only a condensed summary. The main thread stays clean.

This alleviates much of the pressure from context rot, but agents remain general-purpose rather than specialized. An agent knows how to reason and use tools, but it doesn’t know your company’s specific processes, conventions, or requirements. Every agent needs the same detailed instructions repeated, and there’s no good way to share that expertise across your team.

Skills: Lightweight Specialization

This is where skills come into play. Skills package procedural knowledge, the “how we do things here” that agents lack, into lightweight, reusable modules. A skill is just a folder containing instructions, workflows, and optional scripts.

code-review-skill/
├── SKILL.md           # Main instructions with metadata
├── CHECKLIST.md       # Review criteria for your team
└── scripts/
    └── lint_check.py  # Deterministic validation

Skills are only fully loaded on demand. At startup, the model sees just the skill’s metadata, roughly 100 tokens describing what it does. The full instructions load only when the model determines the skill is relevant. This progressive disclosure keeps the context window lean until expertise is actually needed.

Skills can also leverage scripts for deterministic tasks, making them more efficient than general reasoning. When you need exact behavior, like applying a specific code transformation or validating against a checklist, the skill runs a script rather than asking the model to generate the logic each time. Token generation is probabilistic; scripts are deterministic.

Skills can be used in combination with MCPs and agents, but they provide significant value on their own. You don’t need a complex multi-agent setup to benefit. A single model with the right skills can handle sophisticated workflows that would otherwise require elaborate prompting or custom tooling.

Deep Dive: Working with Skills

Let’s look at how to derive practical value from skills in engineering workflows.

What Makes a Good Skill

The best skills encode knowledge that would take an outsider weeks to absorb. Your team’s code review standards, your deployment procedures, your API design conventions. This institutional knowledge typically lives in wikis nobody reads or in the heads of senior engineers. Skills make it actionable.

A skill should have a clear, narrow purpose. “Code review” is better than “help with development.” The model needs to know when to activate the skill, and vague descriptions lead to false matches or missed opportunities.

Include enough context that the skill works without additional explanation. If your code review skill references your team’s error handling patterns, include those patterns in the skill rather than assuming the model knows them.

Structure and Organization

Every skill starts with a SKILL.md file containing YAML frontmatter and markdown instructions:

---
name: api-design
description: Design REST APIs following team conventions
version: 1.0.0
---

# API Design Skill

## Conventions
- Use plural nouns for resource names
- Version APIs in the URL path (/v1/resources)
- Return 201 for successful creation, 200 for updates
...

Additional markdown files can provide specialized guidance. A code review skill might use separate files for security, performance, and style reviews. The model loads these as needed based on the task.

Scripts go in a scripts directory. These handle operations where deterministic execution matters more than flexibility. A skill for database migrations might include a script that validates migration files against your schema conventions before the model even looks at the content.

code-review-skill/
├── SKILL.md ─────────── Instructions + YAML metadata
│                        (loaded when skill activates)
├── CHECKLIST.md ─────── Team-specific review criteria
│                        (loaded when needed)
├── SECURITY.md ──────── Security-focused guidance
│                        (loaded for security reviews)
└── scripts/
	└── lint_check.py ── Deterministic validation
                     	 (executed, not generated)

Where Scripts Add Value

Scripts shine for three types of tasks:

Validation: Check that the generated code meets structural requirements before presenting it. Lint checks, schema validation, and format verification.

Transformation: Apply consistent changes that would be tedious to describe in natural language. Reformatting imports, updating boilerplate, applying code style rules.

Integration: Connect to external systems where precise interactions are required. API calls with specific authentication, database queries with exact syntax, and file operations with particular permissions.

The model handles reasoning and judgment. Scripts handle precision and reliability.

Sharing and Versioning

Skills are just folders, so they work with your existing tools. Store them in Git for version control. Share them through Google Drive or your team’s documentation system. The agentskills.io open standard means skills you create work across platforms that support the format.

For team adoption, start with skills that codify processes you’re already documenting. Onboarding checklists, incident response procedures, and release processes. These have clear value, and the content already exists in some form.

Measuring Impact

Track token consumption before and after skill adoption. The theoretical gains are significant, skills loading ~5,000 tokens versus MCP tools loading 60,000+, but your specific workflows will vary.

More importantly, track whether the model’s outputs improve. Are code reviews catching the issues your team cares about? Are the generated APIs following your conventions? Skills should produce noticeably better results for your specific context, not just save tokens.

Putting It Together

The full stack works like this:

LLM: Provides reasoning and generation capabilities
MCP: Connects the model to external tools and data sources
Agents: Enable parallel processing with isolated context windows
Skills: Supply domain expertise and deterministic procedures

You don’t need all layers for every workflow. A model with skills but no MCP connections can still provide substantial value for tasks that don’t require access to external tools. Skills layered on MCP give you both connectivity and expertise. Agents add parallel processing when tasks genuinely benefit from isolation.

For most engineering teams, skills offer the highest return for the lowest complexity. Start there. Add agents when you have workflows that clearly need parallel, isolated execution. Expand MCP connections as you identify tools the model needs to access.

Looking Ahead

In the course of a year, we’ve seen MCP, agents, and skills emerge as standards for getting more out of LLMs. Each solves a real limitation: MCP connects models to tools, agents manage context isolation, and skills provide lightweight specialization.

It will be interesting to see what this year brings.

Avoiding the AI Complexity Cliff

Alex Robinson — Tue, 20 Jan 2026 13:36:34 GMT

In AI-assisted development, LLMs would ideally understand everything there is to know about our codebase, interpret any instructions, and answer any question with the full context of the project. After all, context is king, especially with LLMs. But most real-world projects are too large and complex for that to be even remotely feasible.

A typical enterprise application might contain more than 200,000 lines of code across 1,500 files. That’s roughly 800,000 tokens before documentation, build configuration, or tests. Even a million-token window can’t hold it all. But capacity isn’t the real constraint. Models start degrading long before they hit the limit.

Research from Stanford and Berkeley identified what they call the lost-in-the-middle phenomenon: LLMs perform best when relevant information sits at the beginning or end of the input context, with significant degradation when models must access information buried in the middle. Performance follows a U-shaped curve, favoring primacy and recency while struggling with everything between.

Practitioners have noticed this gap between advertised capacity and effective use. Dex Horthy at HumanLayer popularized the term “dumb zone” in his presentation “No Vibes Allowed: Solving Hard Problems in Complex Codebases” to describe the region where LLM reasoning degrades as context usage grows past 40-60% of maximum capacity. A model claiming 200K tokens becomes unreliable around 130K. The threshold isn’t a gradual decline. It’s a cliff.

We can’t just feed in context and hope for miracles. Every token competes for the model’s limited attention budget. What we include matters. When we include it matters. The order in which we present information matters. Context engineering becomes as important as the prompts themselves.

Three Patterns, One Problem

What makes AI-assisted development challenging isn’t any single limitation. It’s the interaction of three related patterns that compound as work complexity increases. Understanding these patterns is essential for knowing when to push forward and when to start fresh.

The first is context rot. This is about volume and position, not task difficulty. As you add more tokens to the context window, the model’s ability to find and use relevant information degrades. The culprit is attention: every token competes with every other token. Add 10,000 tokens of background code, and the 500 tokens that actually matter get harder to locate.

Chroma Research documented this effect systematically. Information buried in the middle of long contexts gets missed, even when it’s exactly what the model needs. Related-but-irrelevant content is worse than random noise because it looks relevant enough to distract. The practical result: you can make a model perform worse by giving it more context, even when that context is accurate and related to the task.

The second, and often most damaging in practice, is trajectory poisoning. When an AI coding session goes wrong, the errors don’t just disappear. They become part of the context the model uses for subsequent reasoning. The model references its own mistakes. Attempts to correct them introduce more confusion. Failed approaches, abandoned code paths, and accumulated misunderstandings pollute the working context until the model gets stuck in patterns it can’t escape.

There is no course correction that can fix the context. Once a trajectory goes bad, the context is poisoned. Every subsequent interaction draws on that polluted history. The model confidently builds on flawed foundations, producing output that looks reasonable but compounds the original errors.

The third is the reasoning cliff. This is about task complexity, not context size. As problems require more reasoning steps or involve more interacting variables, model accuracy doesn’t decline gradually. It collapses.

Apple’s GSM-Symbolic research documented this threshold systematically. Adding a single clause that seems relevant to a math problem caused performance drops of up to 65% across all state-of-the-art models, even when that clause contributed nothing to the solution. The models aren’t reasoning through the problem. They’re pattern-matching, and extra complexity breaks the pattern. The practical result: you can’t fix a task that’s past the cliff by prompting better. You have to simplify the task itself.

These patterns interact. Context rot makes trajectory poisoning more likely. Poisoned trajectories push tasks past the reasoning cliff. The compound effect explains why AI-assisted development can feel so inconsistent: sometimes brilliantly helpful, sometimes stubbornly wrong in ways that resist correction.

Diminishing Returns

These phenomena share a common root. Transformer attention scales quadratically with sequence length. Every token must be compared with every other token, so doubling context length quadruples computational cost. But the problem isn’t just compute. The model’s ability to maintain meaningful relationships across vast token distances degrades even when hardware handles the load.

Apple’s research revealed something counterintuitive: when problems become very difficult, models actually expend less reasoning effort. Their internal heuristics signal diminishing returns, so they give up rather than trying harder. Stanford researchers found similar limits on working memory. LLMs can track at most 5-10 variables before performance degrades toward random guessing.

Don’t assume that because an answer is computable from the context, an LLM can compute it.

Managing Complexity

The temptation when hitting these limits is to wait for better models. Larger context windows. More sophisticated reasoning. The tools improve constantly. This approach misses the point.

Addy Osmani captures it well in his LLM coding workflow guide: scope management is everything. Feed the LLM manageable tasks, not the whole codebase at once. Break projects into iterative steps and tackle them one by one. Each chunk should be small enough that the AI can handle it effectively and you can understand the code it produces.

This mirrors good software engineering practice, but it’s even more important with AI in the loop. LLMs do best with focused prompts: implement one function, fix one bug, add one feature at a time. If you ask for too much at once, the model is likely to produce low-quality results.

Keep Tasks Small and Focused

Keep the problem scope narrow enough that you can hold the entire context in your head. If you can’t explain what the AI should do in a sentence or two, the task is too big. Split it.

Provide only the context the AI needs for the current task. Dumping your entire codebase into context doesn’t help. It actively hurts by filling the effective working space with noise.

Use Subagents to Scope Context, Not Define Roles

Agents aren’t just about role-playing different personas. Their real value lies in constraining which context is loaded at each step of work.

A research agent pulls documentation and API references. An implementation skill loads only the module being modified. Testing skills focus on test files and the code under test. Anthropic’s context engineering guidance explains this pattern: rather than having a single agent maintain state across an entire project, specialized sub-agents handle focused tasks with clean context windows.

The subagent pattern works because each step operates with focused context rather than the accumulated noise of everything that came before. Think of skills and subagents as context boundaries, not personality changes.

Clear Context Between Tasks

One of the most practical lessons from production AI coding is to start fresh when output quality degrades. The “poisoned trajectory” problem is real.

Once an AI session goes wrong, the errors become part of the context. The model references its own mistakes. Attempts to correct them introduce more noise. The context fills with false starts and abandoned approaches.

The fix is simple. Commit your work, close the session, and start a new one with a clean context. Treat sessions like git branches. When one goes sideways, abandon it rather than trying to salvage the accumulated mess.

Session hygiene matters more than most practitioners realize. The teams getting consistent value from AI tools have developed strong instincts about when to start over rather than push through degraded output.

Engineering Judgment Still Required

Waiting for the models to fix the problem is a losing strategy. The alternative is treating scope management as a core skill. Decomposing complex problems into focused tasks isn’t a workaround for tool limitations. It’s the practice that makes AI assistance genuinely useful.

This shouldn’t be surprising. Breaking complex problems into manageable pieces was always good engineering practice. We just didn’t notice how much we relied on human ability to infer meaning from an ambiguous context. Humans read a vague ticket, pull in relevant knowledge from memory, ask clarifying questions, and still produce coherent work. LLMs can’t.

Adherence to a structured SDLC that was optional in a pre-AI world becomes essential when your implementation partner takes instructions literally and forgets everything between sessions. Processes like Spec-Driven Development or Research Plan Implementation (RPI) are tailored for working with coding agents. Each phase benefits from AI assistance.

We can’t completely outsource engineering to LLMs. Keeping humans in the loop is critical. Humans are still far more efficient at inferring meaning, managing context, and interpreting ambiguity. It is human ingenuity that is necessary to design the systems, make judgment calls, and decompose problems into chunks that fit within the effective working zone. That work can’t be automated.

The tools improve constantly. The constraints I’ve described will shift. But the fundamental concepts hold. AI amplifies engineering capability rather than replacing engineering judgment. The complexity cliff exists. Working deliberately within their boundaries is how you get value from these tools without drowning in confident-sounding mistakes.

AI Reality Check

Alex Robinson — Tue, 13 Jan 2026 23:37:41 GMT

We’ve been operating under the assumption that AI success is inevitable. Scale the models, add more compute, and wait for the next generation; breakthrough capabilities will emerge. Companies are betting billions on AI’s success.

The pressure on companies to adopt AI or be left behind has been overwhelming. But what if the premise is wrong? What if the current approach faces fundamental constraints that more money and more computing power can’t solve?

The Converging Headwinds

Recently, the tone of the conversation has shifted, with AI now being referred to as a bubble. We’re hitting structural limits, and it seems the technical architecture has plateaued. These challenges aren’t isolated technical problems. The following are a few of the headwinds that AI must overcome to succeed in the long term.

The economics don’t work.

OpenAI projects cumulative losses of $115 billion through 2029. Anthropic burned $2 billion in 2024. The entire industry loses money on every inference request. Companies claim they’ll reach profitability through scale, but 95% of AI pilots fail to deliver returns.

Currently, AI is heavily subsidized, but those subsidies will start to taper off soon. The path to profitability requires price increases that users might not accept. That’s a fundamental business problem.

Scaling has plateaued.

No model feels significantly smarter than GPT-4 from March 2023. OpenAI’s recent models delivered improvements far less noticeable than the leap from GPT-3 to GPT-4. Ilya Sutskever, who pioneered scaling laws, now says “results from scaling up pre-training have plateaued.” A survey of AI researchers found 76% say scaling current approaches won’t reach AGI.

Industry veterans are acknowledging what the benchmarks show: we’ve hit diminishing returns. More compute and larger models aren’t delivering breakthrough capabilities anymore.

The training data is exhausted.

According to Elon Musk, “We’ve now exhausted basically the cumulative sum of human knowledge in AI training.” The industry’s response is synthetic data, but that is likely to trigger model collapse: train on AI-generated content recursively, and models produce gibberish after nine iterations.

Meanwhile, 74% of new webpages contain AI-generated text. The internet is filling with synthetic content, poisoning the well for future training. It’s a degenerative loop with no clear solution.

Datacenter infrastructure can’t keep up.

By 2027, Gartner predicts 40% of AI data centers will hit power shortages. Electricity demand is climbing to 500 terawatt-hours annually, 2.6 times what we used in 2023. Utilities need five years or more to add new capacity. Hyperscalers can’t wait that long, so they’re cutting deals directly with power providers.

Meanwhile, residents near data centers pay $16-18 more per month on their electric bills. The US faces an 11-gigawatt capacity shortfall this year, growing to 10 gigawatts by 2028. Companies scouting data center sites don’t struggle to find land. They struggle to find grid connections and communities willing to accept them.

Water makes things worse. US data centers consumed 174 billion gallons in 2020. Texas alone will hit 399 billion gallons by 2030. Large facilities evaporate 5 million gallons daily, enough for a city of 50,000 people. Two-thirds of new data centers built since 2022 sit in regions already facing water stress, putting them in direct competition with residential and agricultural.

Unproven ROI.

Between 70-85% of AI initiatives fail to meet expected outcomes, according to MIT and RAND Corporation research. Companies abandoned 42% of their AI projects in 2025, up from just 17% in 2024, per S&P Global Market Intelligence’s survey of over 1,000 enterprises. The average organization cancels 46% of AI proofs of concept before production.

McKinsey’s 2025 State of AI survey found that only 6% of organizations qualify as “AI high performers,” generating an EBIT impact of 5% or more from their AI use. Just 39% report any enterprise-level EBIT impact at all. There’s a fundamental question about whether these tools can deliver ROI at scale.

Hallucinations and AI slop eroding trust.

Developer trust in AI tools dropped from 43% to 33% while active distrust climbed to 46%, according to Stack Overflow’s 2025 survey. The decline stems from the effort required to review and rework AI-generated code to address hallucinations, security issues, and quality problems. AI generates code faster than humans can review it.

Research by Faros AI found that code review times increased by 91% and pull request sizes grew by 154% with AI adoption. GitClear’s analysis of 211 million lines of code revealed 10x more duplication and a doubling of code churn, driving a 41% increase in bugs and long-term maintenance costs.

Legal exposure and regulatory requirements are mounting.

A number of copyright lawsuits target AI companies for training on protected works without permission. In February 2025, Thomson Reuters won the first major US ruling against Ross Intelligence. The court found AI training harmed the market for original content and failed the transformative use threshold under the fair use doctrine. Additional lawsuits claim ChatGPT contributed to suicides, testing whether AI companies face product liability for harmful outputs.

The EU AI Act requires full compliance by August 2026, with fines reaching 7% of global annual turnover. Like GDPR, it applies extraterritorially to any company serving EU users. Organizations must classify AI systems by risk level, maintain decision-making documentation, implement human oversight mechanisms, and register high-risk systems in an EU database. Legal costs and regulatory compliance will rise regardless of technical progress.

What This Means for Engineering Leaders

These headwinds aren’t future concerns. They’re affecting decisions now.

The smart play is healthy skepticism. When vendors promise transformative capabilities, demand proof tied to specific business outcomes. When projects require significant investment, build exit criteria based on concrete metrics, not faith in future improvements. When teams want to adopt AI for productivity, they calculate actual costs, including compute, licensing, and the productivity tax of dealing with hallucinations.

The more important question is strategic positioning. If the current AI approach faces fundamental limits, what should we build that doesn’t depend on exponential improvements in capability? Where can we create value with current, plateau-level capabilities? What would our architecture look like if AGI never arrives or delivers only incremental gains?

Where This Leaves Us

I’m not arguing that AI has no value. Current tools deliver real productivity gains in specific contexts. I’m arguing that betting your engineering strategy on continued exponential improvement is increasingly risky.

The evidence suggests we’re hitting fundamental constraints—economic, technical, environmental, and social. Companies that recognize this reality and plan accordingly will be better positioned than those waiting for breakthroughs that may never come.

What patterns are you seeing in your organizations? Are the promised gains materializing, or are you seeing the same quiet disappointment?

Pragmatic AI Adoption

Alex Robinson — Tue, 06 Jan 2026 18:30:40 GMT

Software engineering organizations have made AI adoption a priority. After over a year working with these tools, I’ve settled on a pragmatic approach focused on what we know about software engineering and developer experience.

The promises haven’t materialized. Recent research found experienced developers took 19% longer to complete tasks when using AI tools, despite believing they were 20% faster. Trust is declining as well, even as the models improve. According to a recent Stack Overflow Developer Survey, only 33% of developers trust AI accuracy, while 46% actively distrust it.

Let’s be clear about what we’re working with. LLMs are stochastic parrots that generate tokens based on probability distributions. They don’t reason, learn from mistakes, or remember what failed minutes ago. As scope and complexity increase, output correctness decreases. This isn’t a bug. It’s a fundamental limitation. This is why it is important to prioritize the human-in-the-loop when adopting AI.

What AI Does Well:

Pattern matching across vast training data
Generating boilerplate and common code structures
Quick context synthesis from large text volumes
Identifying potential issues based on common patterns
Providing near-instant feedback on common mistakes

⠀What Humans Do Well:

Understanding business context and user needs
Making judgment calls on trade-offs
Learning from mistakes and adapting approaches
Reasoning about novel problems from limited data
Collaborating across disciplines to clarify requirements

Where AI Adds Value

These differences matter. An organization’s ability to innovate is directly proportional to its long-term value. You can’t automate away the work that drives competitive advantage. The real value lies in quality-of-life improvements when AI supports rather than replaces developer roles. Here a five ways to adopt AI to augment rather than replace developers. Successful adoption requires proper configuration and realistic expectations.

Quality Gates

One of the easiest ways to introduce AI into the software development lifecycle is to have it review code changes in Pull Requests/Merge Requests. AI acts as a fuzzy logic linter. It won’t eliminate the need for human review, but it can catch common mistakes and suggest improvements that traditional linters cannot.

This reduces the burden on senior developers who perform code reviews and provides faster feedback to code authors, allowing them to address issues more quickly. However, AI tools can produce more noise than signal if not correctly configured.

Code Assistant

When integrated into an IDE, AI enhances refactoring, autocomplete, and quick-fix tools by leveraging patterns from training data. It can also be used to explain complex code when documentation is missing.

Copying and pasting code from a chat interface or working from a terminal CLI only creates more friction. While articulating a problem can be a valuable thinking tool, a chat interface is the wrong UI for most programming tasks. The AI coding CLIs are amazing in what they can do, but they make it difficult for developers to see the full context of the software, thereby relinquishing control to the AI. The AI code assistant should be nearly invisible, keeping the developer in the flow. So invest in an IDE or text editor that has native AI integration.

Analysis

The source of many software bugs is a misunderstanding of requirements. AI can help by asking clarifying questions, rephrasing business requirements in engineering terms, and translating needs into technical designs.

This leads to clearer shared understanding between engineering and product, more thoroughly considered designs, and faster identification of requirement gaps before implementation. At the same time, AI can’t make decisions about technical trade-offs or replace human-to-human collaboration.

Debugging

AI assists in analyzing crash logs or error logs to find root causes. The most valuable implementations don’t just read logs in isolation. They correlate signals across the system, linking runtime behavior, performance counters, recent code changes, and deployment history to identify patterns humans might miss.

This accelerates troubleshooting and helps generate root-cause analysis documentation, preventing similar issues in the future.

Research

Whether designing a feature, developing a strategy proposal, or fixing a bug, web search tasks can be outsourced to AI. This can simplify the research efforts by leveraging AI to quickly search across multiple sources and summarize findings. This can be useful for exploring both divergent and convergent ideas to weigh trade-offs and identify gaps.

With a good prompt or MCP server, AI can be constrained to search only trusted sources and official documentation, reducing the likelihood of hallucinations. But it is still important to think critically of any finding to avoid losing time going down the wrong path.

Setting Realistic Expectations

The current LLM architecture can’t replace human ingenuity or drastically improve developer velocity when working on complex software. But it can make daily work less tedious. That’s valuable, even if it’s not the transformation vendors promised.

Be realistic about AI’s impact. Research shows automation often shifts toil rather than eliminating it. Time saved in writing code often reappears later in increased review time, rework, or bug fixing. So instead of positioning AI as a productivity improvement, position it as a developer experience improvement to amplify what makes great engineers valuable: judgment, creativity, and reasoning through complex problems.

The risk of poor AI implementation isn’t just immediate productivity loss. It’s long-term skill erosion. Over-automation creates engineers who can’t diagnose problems because they never learned the fundamentals. Junior engineers need to make mistakes, debug their own code, and understand system behavior from first principles. AI should guide this learning, not replace it.

Simplify Your iOS CI with Makefiles

Alex Robinson — Tue, 01 Oct 2024 10:03:27 GMT

An iOS project of any size can benefit from automating the build process. Before reaching for a more complex solution like Fastlane or GitHub Actions, consider starting with the humble MAKE utility. MAKE has been around since 1976 and is preinstalled on any UNIX-like OS, including macOS, making it easy to adopt.

The Elegant Simplicity of MAKE

MAKE is the inspiration for many modern build tools, including Fastlane, and is still an excellent option for many projects needing automation but don't want to invest in complicated build systems. Here are a few benefits of using Makefiles over some alternatives.

1. Simplicity: Makefiles use a straightforward syntax that's easy to read and write.

2. Portability: Since MAKE is preinstalled on most UNIX-like systems, there are no dependencies to install and maintain on CI servers or developer machines.

3. Consistency: They provide a single source of truth for build and test commands.

4. Version Control: Your build process is now versioned alongside your code.

Let's look at how you can implement a Makefile for your iOS CI process.

Implementing a Makefile for iOS CI

Here's a basic structure for your iOS project's Makefile:

PROJECT = $(shell ls -1 . | grep .xcodeproj)
SCHEME = ExampleProject
BUILD_DIR = .build

.DEFAULT_GOAL := help

help:
  @echo Goals:
  @echo "help - display this message"
  @echo "clean - clean the project"
  @echo "build - build the project"
  @echo "test - test the project"

clean:
  @echo Cleaning...
  xcodebuild clean \
    -project $(PROJECT) \
    -scheme $(SCHEME) \
    | xcbeautify
  @rm -rf $(BUILD_DIR)

build: clean
  @echo Building...
  xcodebuild build \
    -project $(PROJECT) \
    -scheme $(SCHEME) \
    -derivedDataPath $(BUILD_DIR) \
    | xcbeautify

test: clean
  @echo Testing...
  xcodebuild test \
    -project $(PROJECT) \
    -scheme $(SCHEME) \
    -derivedDataPath $(BUILD_DIR) 

.PHONY: help clean build test

Breaking Down the Makefile

This Makefile defines several goals: help, clean, build, and test. Let's examine each goal:

1. help: Is a convenient goal to print out the available goals.

2. clean: Cleans the build folder.

3. build: Builds the project using xcodebuild.

4. test: Runs the test suite.

This is a simple example Makefile that can easily be extended to include goals for deploying to TestFlight or running SwiftLint.

Define Prerequisites

Prerequisites of a goal are listed after the colon, ensuring that it executes the prereqs before executing the specified goal.

build: clean

Setting a Default

By convention, running make without additional parameters will execute the first goal in the file. As of MAKE 3.80, the default can be set explicitly using the .DEFAULT_GOAL variable, making it a bit clearer.

.DEFAULT_GOAL := help

Variables

Variable values can be passed in as command-line arguments to MAKE or as environment variables, adding additional flexibility in configuring a Makefile.

make build SCHEME=MyScheme

Why so .PHONY?

By default, Makefile goals are file targets — they are expected to produce a file with the same name as the goal. If that file already exists, it won't run the goal. Using .PHONY tells MAKE that a goal will not produce a file.

GitHub Actions

Turn the Makefile into a Continuous Integration process by integrating it with a GitHub Action workflow.

Keeps the GH Action workflow simple
Developers can use the same commands as CI to execute builds locally

When MAKE isn't enough

While MAKE is a straightforward solution, it can become more complex as a project grows. As a Makefile becomes more complex, there is a point of diminishing returns. To scale a build system for complex projects with many modules, consider alternatives like BUCK or BAZEL, which support concepts like remote build execution and remote build caches.

Wrapping Up

MAKE is a simple yet powerful tool that can bring consistency, efficiency, and clarity to your CI pipeline without hiding details away behind unnecessary abstractions.

As you integrate this approach into your workflow, consider:

How can you tailor the Makefile to your team's specific needs?
What other areas of your development process could benefit from similar standardization?
How might this approach evolve as your team and project grow?

Remember, the goal is to spend less time fighting with CI and more time building great iOS apps. Happy coding!

Mastering Swift Package Collections

Alex Robinson — Tue, 24 Sep 2024 10:00:00 GMT

Swift Package Manager is a powerful solution for managing dependencies and creating modular apps. Over the years, Xcode support for SwiftPM has gradually improved, streamlining the development workflow. In Xcode 13, discovering and managing third-party Swift packages became a little easier with the introduction of Swift Package Collections.

A Package Collection is a JSON file that describes a curated list of packages that can be shared directly or online. It can also be signed to prevent tampering. Collections can be a great way to share dependencies across an organization or make frequently used packages more accessible.

Xcode is pre-configured with a single Package Collection published by Apple. Adding an import statement to a file for one of the packages in this collection results in a fix-it prompt to add the dependency to the project. Adding new collections can enable this same fix-it for other Swift Packages.

The Swift Package Index maintained by Dave Verwer and Sven A. Schmidt makes it easy to find Swift Packages as well as Package Collections.

Add a Package Collection in Xcode

Adding a collection, like the Pointfree open-source packages, to Xcode is straightforward.

Steps to add a collection:

Within Xcode, click File > Add Packages…
Under Package Dependencies, click the +
In the bottom left corner of the package browser, click the +_ and select Add Package Collection…
Paste the URL to the Package Collection
Confirm by clicking Add Collection⠀

Once added, it is easy for Xcode to find and import your favorite packages. Try importing a module that isn’t already a project dependency, and Xcode will warn you. But if the package is part of a package collection registered with Xcode, then Xcode will provide a fix-it that will automatically add the dependency to the project.

Add a Package Collection from the Command Line

Using the Pointfree open-source collection of Swift packages as an example, the following command will add the collection to SwiftPM and make it available in Xcode.

$ swift package-collection add https://swiftpackageindex.com/pointfreeco/collection.json

Creating Your Own Swift Package Collections

Creating new collections can be a convenient way to share common packages used internally within an organization or to group related packages of an open-source project, like Vapor.

Step 1: Define the input of the package collection in a JSON file

To get started, create a JSON file describing the metadata for the collection and the packages to include. At a minimum, the packages should contain a Github URL, but you can also specify additional information to provide additional context. Check out the README to learn more about the options.

{
  "name": "Collection Example",
  "overview": "This is a sample package collection.",
  "keywords": [
    "SwiftUI Navigation",
    "Composable Architecture"
  ],
  "packages": [
    {
      "url": "https://github.com/pointfreeco/swift-composable-architecture"
    },
    {
      "url": "https://github.com/pointfreeco/swiftui-navigation"
    }
  ],
  "author": {
    "name": "Jane Doe"
  }
}

Step 2: Generate the output file with the package-collection-generate command

Apple provides the package collection generator to simplify generating the collection by filling in the detailed metadata from the repositories, including a complete list of the available versions.

The package collection generator does not currently ship with the Swift toolchain, so it needs to be built from the source.

Clone the git repository.

$ git clone https://github.com/apple/swift-package-collection-generator.git

Build the generator from the source.

$ swift build --configuration release

Install the commands on your system path.

install .build/release/package-collection-generate /usr/local/bin/package-collection-generate

install .build/release/package-collection-diff /usr/local/bin/package-collection-diff

install .build/release/package-collection-sign /usr/local/bin/package-collection-sign

install .build/release/package-collection-validate /usr/local/bin/package-collection-validate

Now, with our input.json, the generator can fill in additional metadata for each package from GitHub.

$ package-collection-generate input.json generated-output.json

The generated-output.json is now a complete description of our package collection.

Step 3: Sign Collection (Optional)

To prevent tampering with the collection, you can sign it using the package-collection-sign command. This command takes the generated JSON file as an input and generates a signed collection file. This step requires generating a signing certificate on the Apple developer portal. Check out the SwiftPM documentation for more details on the signing requirements.

$ package-collection-sign
generated-output.json

    signed-output.json

    priivate-key.pem

    certificate.cer

Step 4: Distribute the collection on GitHub or the Swift Package Index

Once you have generated the package collection file, you can publish it. The file can be published to GitHub or the Swift Package Index for public collections. The collection can also be shared directly with team members for internal collections.

These steps might seem like a lot of up-front work to create a Package Collection. Fortunately, collections don’t need to be created frequently.

Conclusion

Package Collections aren’t going to save hours of development time, but they will make commonly used packages more accessible while staying in the editor flow.