Intro

Managers without development experience often struggle to evaluate engineers, falling back on weak proxies like activity counts or confidence "vibes." To do this correctly, you must stop guessing and start measuring performance as impact, not activity. This guide covers how to build a career ladder, collect forensic evidence from artifacts, and run calibration sessions that remove bias.

1. Define Performance as Impact

If you are not a developer, the easiest mistake is to grade the visible surface area of work: number of PRs, Slack responsiveness, or confidence in meetings.

These are weak proxies that incentivize gaming the system (e.g., splitting work to "look busy").

The Durable Framework:
Performance = Impact relative to expectations.

"Good" looks different for a new grad than for a Staff Engineer.

Replace praise-y adjectives with measurable outcomes.

Bad: "Significantly improved performance."
Good: "Reduced time-to-first-byte by 180ms for the 5 most visited endpoints."

Your job is to produce a defensible story that survives scrutiny. Remember: Performance is impact, not activity.

2. Build a Shared Ladder

Expectations are fuzzy without a rubric. A career ladder fixes this by turning "Senior" from a title into a set of observable behaviors.

The Ladder Protects You:
It prevents the common failure mode of comparing two engineers with different shapes of work and concluding one "does more" based on vibes.

The Rule: Compare the engineer against the ladder, not against other people.

This reduces politics and protects specialists whose impact looks different from feature factories.

3. Collect Forensic Evidence

If you can't judge code quality by reading diffs, you can still judge performance by assembling a portfolio of evidence.

The Evidence Pack:

Self-Review: Ask engineers to maintain a "brag document" (a running list of work and impact). This is a memory system, not fluff.
Peer Review: Frame questions to extract signal, not praise. Ask: "What did this person do that changed outcomes for you?"
Artifacts: Request links to design docs, incident writeups, and launch plans.

Do not rely on a single channel like your own impression. Triangulate the truth from artifacts and peers.

4. Run Delta-Based Conversations

A performance conversation is not a verdict; it is a debugging session plus a contract.

The Script:

Expectation: "Here is what your level expects in scope/execution." (Point to the ladder).
Evidence: "Here is the evidence I saw this cycle (2–4 artifacts)."
Delta: "Here is where you exceeded/missed expectations."
Plan: "Here are 1–2 behaviors to amplify next cycle."

Anchor every feedback point to a specific artifact. "You need to be more senior" is arbitrary; "You need to drive the architecture decision for Project X" is actionable.

5. Avoid the "One Metric" Trap

Non-technical leaders often ask, "What metric will you use to evaluate engineers?"

Software is written by teams, not individuals. When you score individuals on team-level throughput metrics, you measure queueing and legacy risk, not performance.

The Gaming Problem:
Any single metric (LOC, tickets closed) can be optimized without increasing value.

Use metrics as prompts for questions ("Why did review time spike?"), but evaluate individuals based on the ladder and evidence.

6. Calibrate for Consistency

If you worry that you "don't know enough," calibration is how you make that worry productive.

Calibration Rules:

Read, don't present: If decisions depend on who pitches best, you reward rhetoric over engineering. Read the written review instead.
Compare to the ladder: This protects you from false equivalencies.
Use the escape hatch: If you don't understand a technical trade-off, ask a technical peer to sanity-check that specific risk. This is operational honesty, not weakness.

Closing Thoughts

You do not need to be a developer to evaluate developers. You need a system that forces evidence and reduces bias.

Your advantage as a non-technical manager is that you are forced to evaluate outcomes, collaboration, and decision quality - not code style.

Performance is impact, not activity.

Do This Next: The Performance Review Checklist

Audit your next review cycle against these four items.

The Ladder Check: Do I have a written rubric defining expectations for this role level?
The Evidence Audit: Do I have 3 specific artifacts (docs, PRs, incidents) to support my rating?
The Adjective Ban: Have I removed words like "significantly" and replaced them with numbers?
The Calibration Step: Have I scheduled a session to read this review with a technical peer?