Learning by Shipping

products, development, management…

Realities of Performance Appraisal

Much has been written recently about performance ratings and management at some large and successful companies. Amazon has surfaced as a company implementing OLRs, organization and leadership reviews, which target the least effective 10% of an organization for appropriate action. Yahoo recently implemented QPRs, quarterly performance reviews, which rates people as “misses” or “occasionally misses” among other ratings. And just so we don’t think this is something unique to tech, every year about this time Wall St firms begin the annual bonus process which is filled with any number of legendary dysfunctions given the massive sums of money in play. Even the Air Force has a legendary process for feedback and appraisal.

This essay looks at the challenges of performance review in a large organization. The primary purpose is to help share the realities that designing and implementing a system for such an incredibly sensitive topic is a monumental challenge when viewed in isolation. If you overlay the environment of an organization (stock price, public perception, revenue or profit, local competition for talent, etc.) then any system at all can seem anywhere from tyrannical to fair to kick-ass for some period of time, and then swing the other way when the context changes. For as much as we think of performance management as numeric and thus perfectly quantifiable, it is as much a product of context and social science as the products we design and develop. We want quantitative certainty and simplicity, but context is crucial and fluid, and qualitative. We desire fair, which is a relative term, but insist on the truth, which is absolute.

While there is an endless quest for simplicity, much as with airline tickets, car prices, or tax codes it is naive to believe that simplicity can truly be achieved or maintained over time. The challenge doesn’t change the universally shared goal of simplicity (believe me HR people do not like complex systems any more than everyone else does) but as a practical matter such purity is unattainable. Therefore comparing any system to some ideal (“flat tax”, “fixed pricing”) only serves to widen the gap between desire and implementation and thus increases the frustration and even fear of a system.

If this topic were simple there would not be over 25,000 books listed on Amazon’s US book site for the query “performance review”. Worse, the top selling books are about how to write your review, game the system, impress your boss, or tell employees they are doing well when they really aren’t doing well. You get pretty far down the list before you get to books that actually try to define a system and even then those books are filled with caveats. My own view is that the best book on the topic is Measuring and Managing Performance in Organizations. It is not about the perfect review system but about the traps and pitfalls of just measuring stuff in general. I love this book because it is a reminder of everything you know about measuring, from “measure twice, cut once” to “measure what you can change” to “if you measure something it goes up” and so on.

Notes. I am not an HR professional and don’t get wrapped up in the nuances of terminology between “performance review” or “performance management” or “performance rating”. I recognize the differences and respect them but will tend to intermix the terms for the purposes of discussion. I also recognize that for the most part, people executing such a system generally don’t see the subtle distinctions in these words as much as they might mean something within the academy. I am also not a lawyer, so what I say here may or may not be legally permitted in your place of doing business (geography, company size, sector). Finally, this post is not about any specific company practice past or present and any similarity is unintended coincidence.

This post will say some things that are likely controversial or appear plain wrong to some. I’ll be following this on twitter to see what transpires, @stevesi.

5 Essential Realities

There are several essential realities to performance reviews:

  • Performance systems conflate performance and compensation with organizational budgets. No matter how you look at it, one person cannot be evaluated and paid in isolation of budgets. The company as a whole has a budget for how much to pay people (salary, bonus, stock, etc.) No matter what an individual’s compensation is part of a system that ultimately has a budget. The vast majority of mechanical or quantitative effort in the system is not about one person’s performance but about determining how to pay everyone within the budget. While it is desirable to distinguish between professional development and compensation, that will almost certainly get lost once a person sees their compensation or once a manager has to assign a rating. Any suggestion as to how to be more fair, allow for more flexibility, provide more absolute ratings, or otherwise separate performance from compensation must still come up with a way to stick to a budget. The presence of a budget drives the existence of a system. There is always a budget and don’t be fooled by “found money” as that’s just a budget trick.
  • In a group of any size there is a distribution of performance. At some point a group of people working towards similar goals will exhibit a distribution of performance. From our earliest days in school we see this with schoolwork. In the workplace there are infinite variables that influence the performance of any individual but the variability exists. In an ideal system one could isolate all the variables from some innate notion of “pure contribution” or “pure skill” in order to evaluate someone. But that can’t be done so the distribution one sees essentially lumps together many performance related variables.
  • In a system where you have m labels for performance, people who get all but the most rewarding one believe they are “so close” to the higher one. In school, teachers have letter grades or numeric ranges that break up test scores into “buckets”. In the workplace, performance systems generally implement some notion of grades or ratings and assign distributions to each of those. Much like a forced curve in a physics test, the system says that only a certain percentage of a population can get the highest performance rating and likewise a certain percentage of the team gets the lowest rating. The result is that most everyone in the organization believes they are extremely close to the next rating much like looking at a test and thinking if you could just get that one extra point you’d get the next letter grade. Because of human nature, any such system almost certain follows the corollary that managers are likely to imply or one being managed likely to hear evidence of just how close a call their review score was. There is a corollary: “everyone believes they are above average“.
  • Among any set of groups, almost all the groups think their group is delivering more and other groups are delivering less. In a company with many groups, managers generally believe their group as a whole is performing better by relevant measures and thus should not be held to the same distribution or should have a larger budget. Groups tend to believe their work is harder, more strategic, or just more valuable while underestimating those contributions from other groups. Once groups realize that there is a fixed budget, some strive to solve the overall challenges by allowing for higher budgets on some teams. In this way you could either use a different distribution of people (more at the top) or just elevate the compensation for people within a group. Any suggestion to do this would need to also provide guidance as to how groups as a whole are to be measured relative to each other (which sounds an awful lot like how individuals would be measured relative to each other).
  • Measurement is not an absolute but is relative. To measure performance it must be measured relative to something. Sales is the “easiest” since if you have a sales quota then your compensation is just a measure of how much you beat the quota. Such simplicity masks the knife fight that is quota settings and the process by which a comp sheet is built out, but it is still a relative measure. Most product have squishier goals such as “finish the product” or “market the product”. The larger the company the more these goals make sense but the less any individual’s day to day actions are directly related (“If I fix this bug will the sale really close?”). Thus in a large company, goal setting becomes increasingly futile as it starts to look like “get my work done” as the interconnection between other people and their work is impossibly hard to codify. Much of the writing about performance reviews focuses on goal setting and the skill in writing goals you can always brag about, unfortunately. All of this has taken a rather dramatic turn with the focus on agility where it is almost the antithesis of well-run to state months in advance what success looks like. As a result, measuring performance relative to peers “doing their work” is far more reliable, but has the downside that the big goals all fall to the top level managers. That’s why for the most part this entire topic is a big company thing—in a startup or a small company, actions translate into sales, marketing, products, and customers all very directly.

10 Real World Attributes

Once you take these realities you realize there will in fact be some sort of system. The goal of the system is to figure out how much to pay people. For all the words about career management, feedback, and so on that is not what anyone really focuses on at the moment they check their “score”. It certainly isn’t what is going on around the table of managers trying to figure out how to fit their team within the rules of the system.

Those that look to the once a year performance rating as the place for either learning how they are doing or for sharing feedback with an employee are simply doing it wrong—there simply shouldn’t be surprises during the process. If there are surprises then that’s a mistake no system can fix. There are no substitutes for concrete, actionable, and ongoing feedback about performance. If you’re not getting that then you need to ask.

At the same time, you can’t expect to have a daily/weekly rating for how you are doing. That’s because your performance is relative to something and that something isn’t determined on a daily basis. Finding that happy place is a challenge for individuals and managers, with the burden to avoid surprises falling to both equally. As much as one expects a manager to communicate well individuals must listen well.

Putting a system in place for allocating compensation is enormously challenging. There’s simply too much at stake for the company, the team, managers, and individuals. Ironically, because so much is at stake that materially impacts the lives of people, it is not unusual for the routine implementation of the process to take months of a given year and for it to occupy far more brain cycles than the actual externally facing work of the organization. Ironically, the more you try to make the process something HR worries about the even more disconnected it becomes from work and the more stress. As a result, performance management occupies a disproportionate amount of time and energy in large organizations.

Because of this, everyone in a company has enough experience to be critical of the system and has ideas how to improve it. Much like when a company does TV advertising and everyone can offer suggestions—simply because we all watch TV and buy stuff—when it comes to performance reviews since we all do work and get reviewed we all can offer insights and perspectives on the system. Designing a system from scratch is rather different than being critical of anecdotes of an existing system.

Given that so much is at stake and everyone has ideas how to improve the system, the actual implementation is enormously complex. While one can attempt to codify a set of rules, one cannot codify the way humans will implement the rules. One can keep iterating, adding more and more rules, more checks and balances, but eventually a process that already takes too much time becomes a crushing burden. Even after all that, statistically a lot of people are not going to be happy.

Therefore the best bet with any system is to define the variables and recognize that choices are being made and that people will be working within a system that by definition is not perfect. One can view this as gaming the system, if one believes the outcome is not tilted towards goodness. Alternatively, one can view this as doing the right thing, if one believes the outcome is tilted in the direction of goodness.

My own experience, is that there are so many complexities it is pointless to attempt to fully codify a system. Rather everyone just goes in with open eyes and a realistic view of the difficulty of the challenge and iterates through the variables throughout the entire dialog. Fixating on any one to the exclusion of others is when ultimately the system breaks down.

The following are ten of the most common attributes that must be considered and balanced when developing a performance review system:

  1. Determining team size. There is critical mass of “like” employees (job function, seniority, familiarity, responsibility) required to make any system even possible. If you have less than about 100 people no system will really work. At the same time, at about 100 people you are absolutely assured of having a sample size large enough to see a diversity in performance. There is going to be a constant tension between employees who believe the only fair way of evaluation is to have intimate knowledge of their work and a system that needs a lot of data points. In practice, somewhere between 1 and 5 people are likely to have intimate understanding of the work of an individual, but said another way any given manager is likely to have intimate knowledge of between 5 and about 50 people. At some point the system requires every level of management to honestly assess people based on a dialog of imperfect information. Team size also matters because small “rounding” efforts become enormous. Imagine something where you need to find 10% of the population and you have a team of 15 people to do that with. You obviously pick either 1 or 2 people (1 if the 10% is “bad”, 2 if it is “good”). Then imagine this rolls up to 15,000 people. Rather than 1500, you have either 1000 or 2000 people in that 10%. That’s either very depressing or very expensive relative to the budget. Best practice: Implement a system in groups of about 100 in seniority and role.
  2. Conflating seniority, job function, and projects does not create a peer group. Attempting to define relative contribution of a college new hire and a 10 year industry vet, or a designer and a QA engineer, manager or not, or a front-end v. ops tools are all impossibly difficult. The dysfunction is one where invariably as the process moves up the management chain there will be a bias that builds—the most visible people, the highest paid people or jobs, scarcest talent, the work that is understood and so on will become the things that get airtime in dialogs. There’s nothing inherently evil about this but it can get very tricky very quickly if those dialogs lead to higher ratings/compensation for these dimensions. This can get challenging if these groups are not sized as above and so you’ll find it a necessary balancing act. Best practice: Define peer groups based on seniority and job function within project teams as best you can.
  3. Measuring against goals. It is entirely possible to base a system of evaluation and compensation on pre-determined goals. Doing so will guarantee two things. First, however much time you think you save on the review process you will spend up front on an elaborate process of goal-setting. Second, in any effort of any complexity there is no way to have goals that are self-contained and so failure to meet goals becomes an exercise in documenting what went wrong. Once everyone realizes their compensation depends on others, the whole work process becomes clouded by constant discussion about accountability, expectation setting, and other efforts not directly related to actually working things out. And worse, management will always have the out of saying “you had the goal so you should have worked it out”. There’s nothing more challenging in the process of evaluation than actually setting goals and all of this is compounded enormously when the endeavor is a creative one where agility, pivots, and learning are part of the overall process. Best practice: let individuals and their manager arrive at goals that foster a sense of mastery of skills and success of the project, while focusing evaluation on the relative (and qualitative) contribution to the broader mission.
  4. Understanding cross-organization performance. Performance measurement is always relative, but determining performance across multiple organizations in a relative sense requires apples to oranges comparisons, even within similar job functions (i.e. engineering). If one team is winding down a release and another starting, or if one team is on an established business and another on a new business, or if one team has no competitors and another is in an intense battle, or if one team has a lot of sales support and another doesn’t, and so on are all situations which make it non-obvious how to “compare” multiple teams, yet this is what must happen at some level. Compounding this situation is that at some point in evaluation the basis for relative comparison might dramatically change—for example, at one level of management the accomplishment of multiple teams might be looked at through a lens that can be far removed from what members of those teams might be able to impact in their daily work. Best practice: do not pit organizations against each other by competing for rewards and foster cross-group collaboration via planning and execution of shared bets.
  5. Maintaining a system that both rates and rewards. Systems often have some sort of score or a grade and they also have compensation. Some think this is essential. Some think this is redundant. Some care deeply about one, but only when they are either very happy or very unhappy with the other. A system can be developed where these are perfectly correlated in which case one can claim they are redundant. A system where there is a loose correlation might as well have no correlation because both individuals and managers involved are hearing what they want to hear or saying one thing and doing another. At the same time, we’re all conditioned for a “score” and somehow a bonus of 9.67% doesn’t feel like a score because you don’t know what this means relative (so even though people want to be rated absolutely it doesn’t take long before they want to know where that stands relatively). Best practice: A clear rating that lets individuals know where they stand relative to their peer group along with compensation derived from that with the ability of a manager with the most intimate knowledge of the work to adjust compensation within some rating-defined range.
  6. Writing a performance appraisal is not making a case. Almost all of the books on Amazon about performance reviews focus on the art of writing reviews. Your performance review is not a trial and one can’t make or break the past year/month/quarter by an exercise in strong or creative writing. This holds for individuals hoping to make their performance shine and importantly for managers hoping to make up for their lack of feedback/action. The worst moments in a performance process are when an employee dissects a managers comments and attempts to refute them or when a manager pulls up a bunch of new examples of things that were not talked about when they were happening. Best practice: Lower the stakes for the document itself and make it clear that it is not the decision-making tool for the process.
  7. Ranking and calibrating are different. Much has been said about the notion of “stack rank” which often is used as a catch phrase for a process that assigns each member of a group a “one through n” score. This is always always a terrible process. There is simply no way to have the level of accuracy implied by such a system. What would one say to someone trying to explain the difference between being number 63 and number 64 on a 100 person team? The practice of calibrating is one of relative performance between members of peer groups as described above. The size and number of these groups is fixed and when done with adequate population size can with near certainty avoid endless discussion over boundary cases. Best practice: Define performance groups where members of a team fall but do not attempt to rank with more granularity or “proximity” to other groups.
  8. Encouraging excellent teams. Most managers believe their teams are excellent teams, and uniquely so. Strong performers have been hired to the team. Weak performers are naturally managed out if they somehow made it on to the team. Results show this. It becomes increasingly difficult to implement a performance review system because organizations become increasingly strong and effective. This is how it should be. At the same time this cannot possibly be a permanent state (even the sports teams get new players that don’t pan out over the course of a season). In a dynamic system there will be some years where a team is truly excellent and some years where it is not, but you can’t really know that in an absolute sense. In fact, the most ossifying element of performance appraisal is to assume that a given team or given person has reached a point where they are just excellent in an absolute sense and thus the system no longer applies. Whether the team is 100 engineers just crushing it or an executive team firing on all cylinders, it is very tempting to say the system doesn’t apply. But if the system doesn’t apply you don’t really have a system. Perhaps your organization will have a concept of “tenure” or you have a job function primarily compensated by quota based on quantitative measures—those are ways to have different systems. Best practice: Make a system that applies to everyone or have multiple systems and clear rules how membership in different systems is determined.
  9. Allowing for exceptions creates an exception-based process. When a team adds all of the potential constraints up and attempt to finally close in on performance of individuals, there is a tendency to “feel the pain” of all the rules and to create a model for exceptions. For example, you might have a 10% group but allow for up to 1% exceptions. Doing so will invariably create either the 9% or 11% group depending on if it is better to except up or down. If managers have the option of giving someone a low rating by extra money along with other people getting a high rating with less money, then invariably most people will get this mixed message. All of these exceptions quickly permeate an organization and individuals end up considering getting an exception a normal part of the process. Best practice: If there is going to be a system, then stick to it and don’t encourage exceptions.
  10. Embracing diversity in all dimensions. Far too often in performance appraisal and rewards, even within peer groups there is potential for the pull of sameness. This can manifest itself through any number of professional characteristics that can be viewed as either style or actual performance traits. One of the earliest stories I heard of this was about a manager that preferred people to set very aggressive goals for adding features. Unfortunately there was no measure for quality of the work. Other members of the team would focus on a combination of features and quality. Members of the latter group felt penalized relative to the person with the high bug count. At the same time, the team tended to be one that got a lot of features done early but had a much longer tail. Depending on when performance reviews got done, the story could be quite different. Perhaps both styles of work are acceptable, but not appreciating the “perceived slow and steady” is a failing of that manager to embrace styles. The same can be said for personal traits such as the always present quiet v. loud, or oral v. written, and so. Best practice: Any strong and sustainable team will be diverse in all dimensions.


Much more could be said about the way performance appraisal and reward can and should work in organizations. Far too much of what is said is negative and assumes a tone dominated by us v. them or worse a view that this is all a very straight forward process that management just consistently gets wrong. Like so many things in business there is no right answer or perfect approach. If there was, then there would be one performance system that everyone would use and it would work all the time. There is not.

Some suggest that the only way to solve this problem is to just have a compensation budget and let some level of management be responsible. That is a manager just determines compensation for each member of a team based on their own criteria. This too is a system—the absence of a system is itself a system. In fact this is not a single system but n systems, one for each manager. Every group will arrive at a way to distribute money and ratings that meets the needs of that team. There will be peanut butter teams, there will be teams that do the “big bonus”, and more. There will even be teams that use the system as a recruiting tool.

As much as any system is maligned, having a system that is visible, has some framework, and a level of cross-organization consistency provides many benefits to the organization as whole. These benefits accrue even with all the challenges that also exist.

To end this post here are three survival tips for everyone, individuals and managers, going through a performance process that seems unfair, opaque, or crazy:

  • No one has all the data. Individuals love to remind some level of management that they do not have all the data about a given employee. Managers love to remind people that they see more data points than any one individual. HR loves to remind people that they have competitive salary data for the industry. Executives remind people they have data for a lot of teams. The bottom line is that no one person has a complete picture of the process. This means everyone is operating with imperfect information. But it does not follow that everyone is operating imperfectly.
  • Share success, take responsibility. No matter what is happening and in what context, everyone benefits when successes are shared and responsibility is taken. Even with an imperfect system, if you do well be sure to consider how others contributed and thank them as publicly as you can. If you think you are getting a bad deal, don’t push accountability away or point fingers, but look to yourself to make things better.
  • Things work out in the end. Since no system is perfect it is tempting to think that one data point of imperfection is a permanent problem. Things will go wrong. We don’t talk about it much, but some people will get a rating and pay much higher than they probably deserve at some point. And yes, some people will have a tough time that they might not really deserve in hindsight. In a knowledge economy, talent wins out over time. No manager will hold one datapoint against a talented person who gracefully recovers from a misstep. It takes discipline and effort to work within a complex and imperfect system—this is actually one of the skills required for anyone over the course of a career. Whether it is project planning, performance management, strategic choices, management processes and more all of these are social science and all subject to context, error rates, and most importantly learning and iteration.

—Steven Sinofsky (@stevesi)

Written by Steven Sinofsky

November 9, 2013 at 11:00 am

Posted in posts

%d bloggers like this: