Managers without development experience - How do you effectively assess performance of software engineers?

TL;DR

Assessing engineers without a development background isn’t about guessing who’s “smart” or counting tickets; it’s about combining results, activity signals, and peer insight into a coherent picture. When you treat outcomes as the anchor and use activity metrics and feedback as leading indicators - not scores - you can evaluate engineers fairly without pretending to be a senior developer.

  • Put outcomes first: what changed for customers, systems, and the business because of this engineer’s work.
  • Treat activity stats (tasks, story points, pull requests) as early warning signals and diagnostics, not leaderboards.
  • Use structured peer and team feedback to explain patterns numbers can’t.
  • Judge performance against the engineer’s actual context: role, domain, constraints, and team maturity.
  • Make expectations explicit and revisit performance continuously - not just once a year.

The Performance Question Every Non-Technical Manager Faces

How do you judge a job you can’t do yourself? How do you decide which engineer is quietly carrying a system on their back and which one is simply “busy” in Jira? And when your CFO asks why engineering is the largest line on the P&L, what can you show beyond a wall of tickets and a gut feeling that people are working hard?

This isn’t a theoretical problem. Software teams are expensive. Classic work like Peopleware has noted that programmers’ productivity can vary by a factor of ten between the fastest and slowest individuals in the same broad population. Yet those differences are rarely reflected cleanly in performance reviews, compensation, or promotion decisions.

At the same time, many organizations openly admit that their review processes don’t work. Deloitte’s recent work on performance management reports that roughly two-thirds of employees see traditional annual reviews as a waste of time that doesn’t help them perform better. Only a minority of executives believe their performance systems reliably identify high and low performers. The result is a familiar pattern: engineers feel misjudged, managers feel under-informed, and HR feels stuck running a process nobody trusts.

Engineers feel this most acutely because so much of their impact is invisible from outside the codebase. Preventative work, refactoring, mentoring, documentation, and incident prevention all matter, but don’t show up cleanly in dashboards or ticket counts. If you aren’t technical, that invisibility makes it easy to over-weight whatever is visible - story points, tickets closed, or how confidently someone speaks in meetings.

So if you manage engineers without a development background, the question isn’t just, “How do I fill out the review form?” It’s:

How do I build a performance view I can defend - to myself, to my engineers, and to the business - without being able to read their code?

To answer that, you need a sharper definition of “performance” and a clearer understanding of what your different signals - results, activity metrics, and feedback - are actually telling you.


What “Performance” Actually Means: Results First, Signals Second

Before you look at a single metric or ask for peer feedback, it helps to settle one deceptively simple question:

When we say an engineer is “performing well,” what do we actually mean?

For software engineers, performance has two layers.

Results - the lagging indicators: What changed for users, systems, and the business because of their work? Did they help ship features customers actually adopt? Did support tickets and incidents in their area go down? Did operational toil or infrastructure cost decrease? These are the outcomes the business ultimately cares about.

Sustainability: Are those results achieved in a way that keeps the system and the team healthy? An engineer who ships quickly but leaves behind brittle code, rising incident rates, or burned-out teammates is effectively borrowing against the future. Peopleware argued decades ago that environment and team health are major determinants of productivity; noisy, interrupt-driven workplaces consistently produced worse results than quiet, focused ones, even with similar talent. That logic extends to individuals: work style and impact on others matter alongside raw output.

This “results + sustainability” framing is consistent with broader management thinking. Andrew Grove’s High Output Management defines a manager’s output as the output of their organization and the neighboring organizations under their influence, not their personal activity. The same principle applies one level down: an engineer’s performance is best captured by the durable value they help the team produce, not how busy they look.

However, results are lagging indicators. By the time you see a failed project, a spike in outages, or a drop in customer satisfaction, the damage is already baked into the quarter. To manage day to day, you need leading indicators that can hint at problems earlier. For non-technical managers, those leading indicators usually fall into two buckets:

  • Activity stats:

    Tasks or story points completed, tickets moved across the board, pull requests created and reviewed, incidents participated in, changes deployed. These are easy to extract from tools and provide a rough picture of motion.

  • Behavioral signals:

    How the engineer works: Do they clarify requirements when stories are vague? Do they break down work into sensible chunks? Do they escalate risks early? Do others seek them out for help? These show up in planning meetings, design discussions, incident calls, and day-to-day collaboration.

Both kinds of signals are useful, but only if you treat them as clues, not as verdicts.

For example, a sustained drop in visible activity might mean an engineer is disengaged or blocked. It might also mean they’ve taken on harder, less divisible work, or are doing glue tasks (mentoring, onboarding, debugging tricky production issues) that aren’t logged well in your tools. On its own, “40% fewer tickets completed this month” doesn’t tell you which interpretation is true. It just tells you to go and ask.

The same caution applies to results. A feature that fails in the market may reflect poor product strategy or changing customer conditions, not necessarily poor engineering execution. Deloitte’s research on performance management highlights that one of the major failings of traditional reviews is misattributing systemic issues to individuals, which is a fast way to erode trust in both the process and leadership.

A practical way to frame all this is:

  • Results tell you what actually happened.
  • Activity stats tell you where to look when something seems off.
  • Behaviors tell you how this person tends to work under real constraints.

In the next section, we’ll look at why this mix is especially hard for non-technical managers to interpret - and how to design around that constraint instead of trying to wish it away.


Why It’s Structurally Hard for Non-Technical Managers

If you're a manager without engineering experience, there are hard limits on what you can directly evaluate. At a basic level, you can’t:

  • Read or critique code directly
  • Judge the complexity of technical tradeoffs
  • Discern whether a “blocked” developer is spinning or solving real ambiguity

That’s not a personal failing; it’s a structural constraint. You’re forced to lean on what you can see from the outside: how people talk in standup, how often their name appears in Slack, whether work appears to ship on time. What you don’t have is the intuition that experienced engineers build over years - an internal sense of which tasks were trivial, which were technically nasty, and which tradeoffs will quietly accumulate into tech debt.

The difficulty grows once performance reviews are expected to be objective, calibrated, and defensible. Under that pressure, non-technical managers often lean too hard on proxies: peer feedback, sprint “velocity,” or estimates. Each of these is useful when treated as one signal among many; each becomes dangerous when treated as the main truth. A confident engineer who speaks smoothly in meetings can look more “senior” than a quieter peer who spends most of their time wrestling with hard backend problems. A team that inflates story points or pads scope can appear more productive than a team that estimates honestly and takes on more complex work.

Research on developer productivity shows that many organizations still rely on task-based measures - story points, hours, or ticket counts - as primary indicators of effectiveness, even though these were never designed to measure impact or quality (McKinsey). The risk isn’t that these metrics exist; it’s that they are read as performance scores rather than as partial, noisy signals.

What you need in this situation isn’t to become a part-time engineer. You need a system for observation, pattern recognition, and structured input: a way to interpret results, activity stats, and peer feedback together over time. Once you accept that you’ll never have perfect visibility, you can stop pretending you do - and start designing a review process that is honest about what you can see and deliberate about how you fill in the gaps. That’s the foundation for fixing how performance reviews go wrong.


How Reviews Go Wrong: Misusing Metrics, Ignoring Feedback, Missing Work

Once you start looking at results, activity, and peer insight, you also start to see how easily each of those can be misused.

A common failure mode is turning activity metrics into performance scores. Story points, ticket counts, and pull requests are attractive because they’re easy to query and chart. But they were designed to help teams plan and coordinate work - not to rank people. When a manager starts treating “points per sprint” as an individual productivity number, engineers learn to game the number: inflate estimates, slice work unnaturally small, or avoid messy, cross-cutting tasks that defy neat tracking. Research on developer productivity warns that task-based metrics are often misaligned with impact and can distort behavior if used as evaluation criteria (McKinsey).

A second failure mode is mishandling results. It’s tempting to attribute a successful project - or a failed one - primarily to the most visible engineer. In reality, outcomes are entangled with scope changes, upstream product decisions, tech debt, staffing, and interruptions. Peopleware shows how environmental factors and organizational choices can dominate individual performance, sometimes by an order of magnitude. When reviews ignore those factors, individuals are punished or rewarded for structural issues they didn’t control.

A third failure mode comes from unstructured peer feedback. Done well, peer input surfaces things you can’t see from outside the team: mentoring, incident leadership, documentation, quiet reliability. Done badly, it devolves into gossip, popularity, or bias. People who speak confidently, share a background with leadership, or spend more time in visible meetings often receive more glowing informal reputations than quieter peers doing equally or more impactful work. Without structure, “feedback” amplifies existing power dynamics rather than correcting them.

Finally, many review systems simply fail to see invisible work at all. Nobody opens a Jira ticket for “spent an hour calming down a panicked teammate” or “rewrote a confusing runbook so on-call is less painful.” Yet that glue work is what holds teams together. Books like The Manager’s Path and essays like “Being Glue” have argued that leaders need to actively surface, name, and reward this work, or it will fall disproportionately on a few people and never count toward advancement.

When you look across these failure modes, a pattern emerges: it’s not that metrics or feedback are inherently bad. It’s that they’re being used as shortcuts for judgment instead of as inputs into it. The next step is to understand what engineers actually do, so you can interpret those signals in context.


A Quick Orientation: What Engineers Actually Do

You don’t need to memorize the entire software development lifecycle, but you do need a basic map of where engineers spend their time. Most teams move repeatedly through five broad phases:

  • Discovery and planning – understanding user needs, constraints, and business goals; shaping what should be built.
  • Design and scoping – deciding how to build it, estimating complexity, and breaking work into chunks.
  • Implementation and review – writing code, reviewing others’ work, and integrating changes.
  • Testing and rollout – validating behavior, managing risk, and deploying changes.
  • Operations and maintenance – monitoring, responding to incidents, paying down technical debt, and iterating.

Individual engineers don’t weigh each phase equally. A backend engineer working on payments might spend a lot of time in design, risk analysis, and testing. A frontend engineer in a growth team might iterate quickly through small experiments and A/B tests. Platform and infrastructure engineers may touch almost no user-facing features at all but fundamentally shape how fast everyone else can move.

Roles add another lens. Roughly:

  • Junior engineers focus on learning, building simple features, and following established patterns.
  • Mid-level engineers own features end-to-end, handling ambiguity with some guidance.
  • Senior engineers shape technical direction, de-risk projects, and mentor others.
  • Staff/principal engineers influence multiple teams and critical systems, often doing design and coordination work that looks “invisible” on boards.
  • Tech leads balance architecture, execution, and people coordination for a team.

If you judge everyone by the same surface metrics - tickets closed, story points, or number of pull requests - you’ll miss that a senior engineer may deliver impact mostly through design, mentoring, and reducing risk. That’s why any fair approach has to reason at the level of how work flows through the system, not at the level of “who closed the most tasks this sprint.”

A simple way to get there is to treat engineering as a black box whose inputs, processes, and outputs you can still observe.


A Black-Box Mental Model: Inputs → Process → Outputs

When you can’t directly inspect code, you can still reason about how engineers work by looking at three stages: inputs, process, and outputs.

Inputs are the problems and constraints the engineer is given. Are they working with clear requirements or with fuzzy product ideas? Is the system they touch well-factored or a tangle of legacy code? Are there dependencies on other teams that regularly cause delays? These conditions vary hugely across teams and projects. Process is how the engineer responds to those inputs. Do they clarify requirements early or code into the void? Do they break work into coherent steps or take on risky, all-or-nothing changes? Do they raise risks in planning meetings, involve the right people in design, and give clear status updates? This is where you see communication, collaboration, and judgment. Outputs are what happens over time: features shipped, incidents created or prevented, system reliability, bugs found in testing instead of production, and how teammates describe working with them.

This black-box model maps neatly to your three pillars:

  • Results live mainly in outputs: what changed in the product, system, or team.
  • Activity stats provide a window into process: where effort is going, how work moves, whether it’s stuck.
  • Peer feedback spans process and outputs: how decisions felt in practice, who took ownership, who made others better.

As a non-technical manager, you don’t need to see every implementation detail. You do need to ask: given the inputs, was the process thoughtful and collaborative, and did the outputs move things in the right direction? With that framing, we can define the three pillars of a fair review more concretely.


The Three Pillars of a Fair Review: Results, Activity, and Peer Insight

At this point, we can make your evaluation model explicit:

  • Results (lagging indicators)
  • Activity stats (leading indicators)
  • Peer and team feedback (contextual insight)

Results (lagging metrics): These are the business and system outcomes that persist after the work is done: improvements in reliability, reductions in incidents, meaningful features launched, internal tools that reduce manual effort, cost savings, or risk reduction. They answer the question, “What is different because this engineer was here?”

Activity stats (leading indicators): These are the traces in your tools: tasks or story points completed, tickets touched, pull requests opened and reviewed, incidents handled, documentation updated. They answer, “What seems to be happening week to week?” but not “Was it valuable?”

Peer and team feedback: This is how colleagues experience working with the engineer: their collaboration, technical judgment, reliability, mentoring, and impact on team morale. When structured well, it answers, “How do they work with others, and how do others rely on them?”

A fair review doesn’t let any one pillar dominate by default. Instead, it looks for alignment or tension between them:

  • Strong results, steady activity, and positive peer feedback → likely strong performance.
  • Weak results, low activity, and repeated peer concerns → likely underperformance.
  • Mixed patterns (e.g., strong peer praise but weak visible results) → a case to investigate more carefully.

Camille Fournier, in The Manager’s Path, explicitly advocates for 360° input - combining self-review, manager review, and peer feedback - to get a more complete and fair view of an engineer’s performance. That approach fits naturally with this three-pillar model: no single perspective is allowed to dominate.

Once you see performance through these three lenses, you can build a concrete rubric of dimensions that map to observable behaviors rather than vague labels.


Performance Dimensions You Can Actually Observe

To move from philosophy to practice, it helps to break performance into a small set of observable dimensions. These work across most engineering roles:

  • Delivery reliability
  • Ownership and scope
  • Technical quality and risk management (in manager-friendly terms)
  • Collaboration and influence
  • Learning and adaptability
  • Glue and team-enabling work

Delivery reliability – Do they usually deliver what they commit to, given the constraints? When things slip, do they signal early and help replan, or do surprises show up at the end?

Ownership and scope – Do they take responsibility for problems end-to-end, or just execute narrowly defined tickets? Are they the kind of person others trust to own ambiguous work?

Technical quality and risk management – In manager-friendly terms: Do they leave systems more stable and maintainable than they found them? Are their changes associated with fewer incidents, cleaner interfaces, and less rework over time?

Collaboration and influence – Do they make people around them more effective? That includes clear communication, helpful code or design reviews, productive disagreement, and cross-team coordination.

Learning and adaptability – How quickly do they ramp up on new domains or tools? Do they respond constructively to feedback and adjust, or repeat the same mistakes?

Glue and team-enabling work – Do they contribute to documentation, onboarding, process improvements, tooling, and mentoring? Are they one of the people who keep the team functioning smoothly?

Each dimension can be informed by your three pillars. For example:

  • An engineer strong in delivery reliability will have a track record of hitting realistic goals (results), visible activity tied to meaningful work rather than cosmetic tasks, and peer feedback that mentions dependability.
  • Someone strong in glue work might not top the ticket count, but peers will repeatedly name them as a mentor or go-to helper, and their efforts show up indirectly in smoother releases and fewer repeated mistakes.

High Output Management emphasizes that a manager’s output is largely the result of how they structure work and develop their people. A clear, behavior-based rubric helps you do that on purpose, not by instinct.

Once you have these dimensions, the question becomes how to use your most controversial input - activity metrics - without letting them dominate.


Getting Peer and Team Feedback Right

Peer feedback is the only pillar that can see certain types of work clearly: mentoring, pairing, design influence, on-call behavior, everyday reliability. But it only helps if you collect and interpret it carefully.

Unstructured prompts like “What do you think of Alex?” invite bias and personality judgments. Instead, you want questions that force people to describe specific behaviors and situations. For example:

  • “Tell me about a time they unblocked you or helped you understand a system.”
  • “How do they usually behave in design discussions or reviews?”
  • “If you were on call and something broke in their area, how confident would you feel with them on the incident channel?”

This kind of feedback maps directly to your dimensions: collaboration, ownership, technical judgment, and glue work. It also makes it easier to reconcile conflicting views. If one person says “they’re great” and another says “they’re difficult to work with,” concrete examples help you see whether those comments reflect different contexts, conflicting expectations, or genuine patterns.

The Manager’s Path recommends using regular 1:1s and ongoing feedback, not just annual review cycles, to avoid “surprise” narratives at the end of the year. That advice aligns well with your three-pillar model: you want a steady stream of small observations rather than one big, fuzzy snapshot.

When you synthesize peer input, look for:

  • Consistency over time (“reliable mentor,” “steps up in incidents”)
  • Alignment or tension with results and activity patterns
  • Evidence of invisible work you should recognize explicitly

Your role is not to take peer opinions at face value, but to treat them as another facet of the same object. Activity metrics show what moved; results show what changed; peer feedback shows how people experienced that change from inside the team.


Interpreting Results and Lagging Metrics Without Over-Simplifying

Because results matter most, it’s easy to overcorrect and decide that only outcomes count. That approach has problems too.

First, outcomes are often team-level, not individual. A major product launch involves product managers, designers, engineers, QA, marketing, and operations. Attributing all of its success - or failure - to one engineer ignores that reality. Second, outcomes are shaped by constraints: legacy systems, headcount, external dependencies, leadership decisions. An engineer working in a severely under-resourced, fragile system may accomplish more by preventing new incidents than by shipping new features.

A better way to use results is to ask: given their role and context, did this engineer move things in the right direction in a way that others could see and build on?

For example:

  • A senior backend engineer who reduced a key service’s error rate and simplified its data model, making future changes easier.
  • A mid-level frontend engineer who owned a new flow from design to rollout, collaborated well with design and product, and handled feedback without drama.
  • A platform engineer who improved CI reliability, cutting flakiness and reducing wasted engineering time.

McKinsey’s findings about improvements in defect rates, developer experience, and customer satisfaction when organizations invest deliberately in developer productivity underline that focusing on meaningful outcomes - not just activity - pays off at scale. At the individual level, you’re looking for contributions that clearly support those types of improvements.

When results are weak, your next move is not to assign blame, but to investigate:

  • Were goals clear?
  • Were dependencies realistic?
  • Did this engineer have the skills and support needed?

Results should drive inquiry, not shortcuts. They tell you where to look more closely at activity patterns and peer experiences so you can distinguish between systemic issues and individual performance problems.


Context Matters: Team Type, Domain, and Constraints

No evaluation system is fair if it ignores context. Two engineers can produce very different visible outputs while both performing strongly:

  • One works on a marketing landing page with a modern stack, few dependencies, and a short feedback loop.
  • Another works on a payments engine with regulatory constraints, legacy code, and high risk attached to every change.

Comparing their ticket counts or release frequency directly is meaningless.

Resilient Management and related leadership writing emphasize that context - team maturity, domain risk, and organizational goals - must shape how you interpret behavior and outcomes. A “slow” engineer in a high-risk environment may actually be making exactly the right tradeoffs to avoid incidents. A “fast” engineer in a low-risk environment may still be underperforming if they aren’t collaborating well or tackling important work.

As a manager, you should explicitly factor in:

  • Team type: product feature team, platform/infra, security, data, etc.
  • Domain risk: financial, safety-critical, regulated, or low-risk experimentation.
  • System maturity: greenfield vs. deeply entrenched legacy.

The question is always: given this setup, what does strong performance look like? Your three pillars stay the same, but your expectations on each dimension shift. This is where shared rubrics and ladders become important, so that context can be discussed openly rather than applied implicitly.


From Personal Judgment to Shared Rubrics and Career Ladders

If performance only lives in your head, it will eventually leak bias and create confusion. Engineers will try to guess what you value; different managers will apply different standards; promotions will feel arbitrary.

A more robust approach is to work with engineering leadership to translate your dimensions and three-pillar model into a shared rubric or career ladder. Books like The Manager’s Path advocate for clearly describing expectations at each level in terms of behaviors and impact over time, not just years of experience or tools known.

Rather than inventing a new rubric, many teams adapt public engineering ladders from companies they respect and then tune them to their own context: seniors and staff have broader influence across those dimensions; juniors are still building depth in a subset.

The key is for this language to be:

  • Written down and accessible
  • Discussed regularly in 1:1s, promotion cases, and hiring interviews
  • Updated deliberately, not ad hoc

For non-technical managers, having a shared rubric is a guardrail. It reduces the chance that personal preference, proximity, or anecdote dominate reviews. For engineers, it reduces guesswork: they can see what “good” and “next level” look like in terms that relate to their day-to-day work.


Making Performance a Continuous Conversation

The best-designed rubric still fails if it only appears once a year in a formal document. Engineering work is too fluid for that. Priorities change, incidents happen, roadmaps shift, and people grow - or get stuck.

High-trust teams fold performance into the regular rhythm of work. That doesn’t mean every 1:1 becomes a micro-review, but it does mean you touch on the three pillars as part of ongoing coaching:

  • Talk about results: what impact did the last period’s work have on users, systems, or the team?
  • Look at activity patterns together: are they feeling blocked, overloaded, or underutilized? Are they doing untracked work you should start recognizing explicitly?
  • Bring in peer feedback early: share small observations from others while they’re fresh, not just in a formal cycle.

A study by CultureAmp found that companies with regular informal feedback loops saw 31% higher engagement and 22% better retention.

When performance becomes part of ongoing coaching, it shifts from judgment to growth.


The Culture You Create When You Judge Fairly

Evaluation is not just an HR process; it is one of the strongest culture-shaping tools you have.

If you reward only visible heroics and raw output, engineers learn to optimize for visibility. They chase quick wins, avoid hard but necessary cleanup work, and burn themselves out fixing fires that could have been prevented. Over time, systems become brittle, turnover rises, and the people doing quiet, stabilizing work leave or disengage. Peopleware quantifies the cost of that churn, both in lost productivity and in the difficulty of rebuilding cohesive teams.

If you reward impact, ownership, collaboration, learning, and glue work, you get something very different: teams that steadily reduce risk, improve user and developer experience, and share knowledge. McKinsey’s findings around organizations that invest in developer experience and meaningful metrics - seeing fewer defects, better employee experience, and happier customers - are one macro-level reflection of what individual fair evaluation supports.

As a non-technical manager, you may never be able to personally review every design or reason about every algorithm. But you can decide:

  • Which signals you pay attention to.
  • How you combine results, activity, and peer insight.
  • How transparent you are about expectations and uncertainty.
  • How quickly you correct your own misjudgments.

You don’t need to write code to evaluate great engineering - but you do need structure, context, and curiosity. When you shift from tracking activity to understanding impact, and from chasing a single metric to reading patterns across your three pillars, you build a performance system that engineers can trust. That trust is what keeps people engaged, teams aligned, and the organization able to justify the investment it makes in its engineers.