AI Can Decrease Your Productivity If Not Used Correctly

Lessons from using GitHub Copilot for unit testing in a legacy codebase

May 8, 2026·

AIGitHub CopilotTestingProductivity

AI Can Decrease Your Productivity If Not Used Correctly

The Goal

Set up a unit testing framework and write unit tests
Gradually increase code coverage over time
Reduce bugs and defects, improving overall product quality

The Story

Unit testing in legacy codebases always surfaces interesting stories. What follows is one from that very genre.

The team was working on a legacy codebase with limited test coverage. The code was written in a way that made it inherently difficult to test — tightly coupled dependencies, no interfaces, business logic embedded deep in classes with no clear boundaries. Rather than tackle this manually, the team decided to bring in GitHub Copilot to accelerate test generation.

It seemed like the right call. Copilot had been helpful elsewhere. Why not here? They soon found out.

The Problems

The team started using Copilot to generate PHPUnit tests. But what came back was often unreliable, inconsistent, and sometimes completely wrong.

1. Irrelevant tests

Copilot generated tests that had little to do with the actual code being tested. The method names matched, but the logic being asserted had nothing to do with what the method actually did.

2. Framework mismatch

The team had a clear testing framework standard. Copilot occasionally ignored it, generating tests using patterns or conventions from other frameworks — creating inconsistency that was easy to miss during a quick scan.

3. Hallucinations

Copilot invented method calls, class names, and return values that simply did not exist in the codebase. These tests would fail immediately at runtime, but only if someone actually ran them — which, as it turned out, wasn't always happening.

4. Markdown in code files

In some cases, Copilot generated test files that contained raw Markdown syntax — explanation text, headers, code fences — embedded directly in PHP files. These were unusable without manual cleanup.

The Root Cause

The real problem wasn't Copilot. It was the team's relationship with Copilot's output.

Team members placed too much trust in what was generated. The assumption was: Copilot is AI, AI is smart, therefore the output is probably fine. Tests were being committed without thorough review, coverage metrics weren't improving, and time was being lost to debugging AI-generated code instead of writing real features.

This was the opposite of what the team had set out to achieve.

The Fix: Copilot Agents

The team regrouped and identified two root problems:

Copilot wasn't being given enough context to generate accurate tests
There was no structured review step before generated tests were committed

This is where Copilot Agents made the difference. A Copilot agent is a set of structured instructions that guides Copilot through a specific task with clear constraints, a defined workflow, and an explicit output format.

Here's the agent the team built — a UnitTests Generator:

markdown

---
description: "Use when: generating PHPUnit tests, creating new test cases, writing
  tests with mocks and stubs, testing isolated units, reviewing existing tests"
name: "UnitTests Generator"
tools: [read, edit, search]
user-invocable: true
---

You are a specialist at generating and reviewing PHP unit tests using PHPUnit
with proper mocking.

## Constraints

- DO NOT use terminal execution or run commands
- DO NOT mix testing frameworks; always use PHPUnit for this codebase
- DO NOT generate tests without understanding the source code first
- DO NOT test external dependencies; mock all collaborating objects

## Approach

1. Analyze Source Code: Read PHP classes/functions to understand their behavior
2. Identify Dependencies: Determine what collaborating objects need to be mocked
3. Design Mock Strategy: Decide which dependencies to mock and expected behaviors
4. Generate Tests: Write PHPUnit test methods using createMock() or getMockBuilder()
5. Review Quality: Check for proper naming, adequate mocking, and complete assertions
6. Document Intent: Include comments explaining what each test verifies

## Output Format

When generating tests:
- Include a PHPUnit test class with proper namespace and use statements
- Use createMock(), expects(), willReturn() for mocking
- Each test method follows the testDoesXWhenYHappens naming pattern
- Include a setUp() method to initialize common mocks

When reviewing tests:
- Identify what"s working well
- List gaps in mock coverage or test isolation issues
- Suggest specific improvements with code examples
- Provide a quality rating: needs improvement / good / excellent

Why the Agent Works

It constrains before it generates

The Constraints section explicitly tells Copilot what not to do. Don't mix frameworks. Don't generate tests before reading the source. Don't touch external dependencies. This directly addressed the hallucination and framework-mismatch problems from before.

It enforces a workflow, not just an output

The Approach section sequences Copilot's thinking: analyze → identify dependencies → design mock strategy → generate. This mirrors how a senior developer would approach writing a test, rather than letting Copilot pattern-match against whatever training data felt relevant.

It separates generation from review

The agent handles both writing and reviewing tests, with different output formats for each. The built-in review mode gave the team a way to audit AI output systematically — the exact step that was missing before.

It's opinionated about conventions

Naming conventions, class structure, setup methods — none of these are left to interpretation. Copilot follows the project's standards because the agent encodes them explicitly.

Lessons Learned

Always read the source before generating. An AI that doesn't understand what it's testing will invent things that sound plausible.
Human review is non-negotiable. AI output is a draft, not a deliverable. Treat it accordingly.
Agents are structured investment, not magic. Writing a good agent takes time upfront, but it pays back with every test generated from that point forward.
Guardrails make AI tools faster, not slower. The more precisely you define what you want, the less time you spend fixing what you didn't.

Closing Thought

The agent didn't make Copilot smarter. It made Copilot focused.

The difference between an AI tool that wastes your time and one that accelerates it often comes down to how well you've defined the guardrails — not how capable the model is. The model was capable the whole time. The team just hadn't given it what it needed to succeed.

That's a lesson worth carrying beyond unit testing.

← Back to all posts