Posts Tagged ‘engineering’
I have spent a lot of time trying to manage work so it is successful outside of a single location. I’ve had mixed results and have found only three patterns which are described below. Before that, two quick points.
First, this topic has come up this time related to the Paul Graham post on the other 95% of developers and then Matt Mullenweg’s thoughtful critique of that (also discussed on Hacker News). I think the idea of remote work is related to but not central to immigration reform and a position one might have on that. In fact, 15 years ago when immigration reform was all but hopeless many companies (including where I worked) spent countless dollars and hours trying to “offshore” work to India and China with decidedly poor results. I even went and lived in China for a while to see how to make this work. Below the patterns/lessons subsume this past experience.
Second, I would just say this is business and business is a social science, so that means there are not rules or laws of nature. Anything that works in one situation might fail to work in another. Something that failed to work for you might be the perfect solution elsewhere. That said, it is always worth sharing experiences in the hopes of pattern matching.
The first pattern is good to know, just not scalable or readily reproducible. That is when you have a co-located and functioning team and members need to move away for some reason then remote work can continue pretty much as it has before. This assumes that the nature of the work, the code, the project all continue on a pretty similar path. Any major disruption—such as more scale, change in tools, change in product architecture, change in what is sold, etc.—and things quickly gravitate to the less functional “norm”. The reality is in this case that these success stories are often individuals and small teams that come to the project with a fixed notion of how to work.
The second pattern that works is when a project is based on externally defined architectural boundaries. In this case little knowledge is required that span the seam between components. What I mean by externally defined is that the API between the major pieces, separated by geography, is immutable and not defined by the team. It is critical that the API not be under the control of the team because if it is then this case is really the next pattern. An example of this might be a team that is responsible for implementing industry standard components that plug in via industry standard APIs. It might be the team that delivers a large code base from an open source project that is included in the company’s product. This works fine. The general challenge is that this remote work is often not particularly rewarding over time. Historically, for me, this is what ended up being delivered via remote “outsourced” efforts.
The third pattern that works is that those working remotely have projects that have essentially no short term or long term connection to each other. This is pretty counter-intuitive. It is also why startups are often the first places to see remote work as challenging, simply because most startups only work on things that are connected. So it is no surprise that for the most part startups tend to want to work together in one location.
In larger companies it is not uncommon for totally unrelated projects to be in different locations. They might as well be at separate companies.
The challenge there is that there are often corporate strategies that become critical to a broad set of products. So very quickly things turn into a need for collaboration. Since most large, existing products, tend to naturally resist corporate mandates the need for high bandwidth collaboration increases. In fact, unlike a voluntary pull from a repository, a corporate strategy is almost always much harder and much more of a negotiation through a design process than it is a code resuse. That further requires very high bandwidth.
It is also not uncommon for what was once a single product to get rolled into an existing product. So while something might be separate for a while, it later becomes part of some larger whole. This is very common in big companies because what is a “product” often gets defined not by code base or architecture but by what is being sold. A great example for me is how PowerPoint was once a totally separate product until one day it was really only part of a suite of products, Office. From that decision forward we had a “remote” team for a major leg of our product (and one born out of an acquisition at that).
That leaves trying to figure out how a single product can be split across multiple geographies. The funny thing is that you can see this challenge even in one product medium sized companies when the building space occupied spans floors. Amazingly enough even a single staircase or elevator ride has the equivalent impact as a freeway commute. So the idea of working across geographies is far more common than people think.
Overall the big challenge in geography is communication. There just can’t be enough of it at the right bandwidth at the right time. I love all the tools we have. Those work miracles. As many comments from personal experience have talked about on the HN thread, they don’t quite replace what is needed. This post isn’t about that debate—I’m optimistic that these tools will continue to improve dramatically. One shouldn’t under estimate the impact of time zones as well. Even just coast to coast in the US can dramatically alter things.
The core challenge with remote work is not how it is defined right here and now. In fact that is often very easy. It usually only takes a single in person meeting to define how things should be split up. Then the collaboration tools can help to nurture the work and project. It is often the case that this work is very successful for the initial run of the project. The challenge is not the short term, but what happens next.
This makes geography a bit more of a big company thing (where often there are resources to work on multiple products or to fund multiple locations for work). The startup or single product small company has elements of each of these of course.
It is worth considering typical ways of dividing up the work:
- Alignment by date. The most brute force way of dividing work is that each set of remote people work on different schedules. We all know that once people have different delivery dates it becomes highly likely that the need (or ability) to coordinate on a routine basis is reduced. This type of work can go on until there are surprises or there is a challenge in delivering something that turns out to be connected or the same and should have been on the same schedule to begin with.
- Alignment by API. One of the most common places that remote work can be divided is to say that locations communicate by APIs. This works up until the API either isn’t right or needs to be reworked. The challenge here is that as a product you’re betting that your API design is robust enough that groups can remotely work at their own pace or velocity. The core question is why would you want to constrain yourself in this way? The second question is how to balance resources on each side of the API. If one side is stretched for resources and the other side isn’t (or both sides are) then geography prevents you from load balancing. Once you start having people in one geography on each side of the API you end up breaking your own remote work algorithm and you need to figure out the way to get the equivalent of in-person communication.
- Alignment by architecture. While closely related to API, there is also a case where remote work is layered in the same way the architecture is. Again, this works well at the start of a project. Over time this tends to decay. As we all know, as projects progress the architecture will change and be refactored or just redone (especially at both early stages and later in life). If the geography is then wrong, figuring out how to properly architect the code while also overlaying geography and thus skillsets and code knowledge becomes extremely difficult. A very common approach to geography and architecture is the have the app in one geo and the service in another. This just forces a lot of dialog at the app/service seam which I think most people agree is also where much of the innovation and customer experience resides (as well as performance efforts).
- Alignment by code. Another way to align is at the lowest level which is basically at the code or module level (or language or tool). Basically geography defines who owns what code based on the modules that a given location creates or maintains. This has a great deal of appeal to programmers. It also is the approach that requires the highest bandwidth communication since modules communicate across non-public APIs and often are not architectural boundaries (the first cases). This again can work in the short term but probably collapses the most in short order. You can often see first signs of this failing when given files become exceedingly large or code is obviously in the wrong place, simply because of module ownership.
If I had to sum up all of these in one challenge, it is that however you find you can divide the work across geography at a point in time, it simply isn’t sustainable. The very model you use to keep work geographically efficient are globally sub-optimal for the evolution of your code. It is a constraint that creates unnecessary tradeoffs.
On big projects over time, what you really want is to create centers of excellence in a technology and those centers are also geographies. This always sounds very appealing (IBM created this notion in their Labs). As we all know, however, the definition of what technologies are used where is always changing. A great example would be to consider how your 2015 projects would work if you could tap into a center of excellence in machine learning, but quickly realize that machine learning is going to be the core of your new product? Do you disband the machine learning team? Does the machine learning team now work on every new product in the company? Does the company just move all new products to the machine learning team? How do you geo-scale that sort of effort? That’s why the time element is tricky. Ultimately a center of excellence is how you can brand a location and keep people broadly aware of the work going on. It is easier said than done though. The IME at Microsoft was such a project.
Many say that agility can address this. You simply rethink the boundaries and ownership at points in time. The challenge is in a constant shipping mode that you don’t have that luxury. Engineers are not fully fungible and certainly careers and human desire for ownership and sense of completion are not either. It is easy to imagine and hard to implement agility of work ownership over time.
This has been a post on what things are hard about remote work, at least based on my experience. Of course if you have no option (for whatever reason) then this post can help you look at what can be done over time to help with the challenges that will arise.
In a post last week,@davewiner described The Lost Art of Software Testing. I loved the post and the ideas about testing expressed (Dave focuses more on the specifics of scenario and user experience testing so this post will broaden the definition to include that and the full range of testing). Testing, in many forms, is an integral part of building products. Too often if the project is late or hurry up and learn or agile methods are employed, testing is one of those efforts where corners are cut. Test, to put it simply, is the conscience of a product. Testing systematically determines the state of a product. Testers are those entrusted with keeping everyone within 360º of a product totally honest about the state of the project.
Before you jump to twitter to tell correct the above, we all know that scheduling, agile, lean, or other methods in no way at all preclude or devalue testing. I am definitely not saying that is the case (and could argue the opposite I am sure). I am saying, however, that when you look at what is emphasized with a specific way of working, you are making inherent tradeoffs. If the goal is to get a product into market to start to learn because you know things will change, then it is almost certainly the case that you also have a different view of fit and finish, edge conditions, or completeness of a product. If you state in advance that you’re going to release every time interval and too aggressively pile on feature work, then you will have a different view of how testing fits into a crunched schedule. Testing is as much a part of the product cycle as design and engineering, and like those you can’t cut corners and expect the same results.
Too often some view testing as primarily a function of large projects, mature products, or big companies. One of the most critical hires a growing team can make is that first testing leader. That person will assume the role of a bridge between development and customer success, among many other roles. Of course when you have little existing code and a one-pizza sized dev team, testing has a different meaning. It might even be the case that the devs are building out a full test infrastructure while the code is being written, though that is exceedingly rare.
No one would argue against testing and certainly no one wants a product viewed as low quality (or one that has not been thoroughly tested as the above referenced post describes). Yet here we are in the second half century of software development and we still see products and services referred to as buggy. Note: Dave’s post inspired me, not any recent quality issues faced by other vendors.
Are today’s products actually more buggy than those of 10, 15, or 20 years ago? Absolutely not. Most every bit of software used today is on the whole vastly higher quality than anything built years ago. If vendors felt compelled, many could prove statistically (based on telemetry) that customers experience far more robust products than ever before. Products still do, rarely, crash (though the impact of that is mostly just a nuisance rather than a catastrophic data loss) and as a result the visibility seems much higher. It wasn’t too long ago that mainstream products would routinely (weekly if not daily) crash and work would be lost with the trade press anxiously awaiting the next updates to get rid of bugs. Yet products still have issues, some major, and all that should do is emphasize the role of testing. Certainly the more visible, critical, or fatal a quality issue might be the more we might notice it. If a social network has a bug in a feed or fails to upload a photo that might be vastly different from a tool that loses data you typed and created.
Today’s products and services benefit enormously from telemetry which informs the real world behavior of a product. Many thought the presence of this data would in a sense automate testing. As we often see with advances that some believe would reduce human labor, the challenges scale to require a new kind of labor or to understand and act on new kinds of information.
What is Testing?
Testing has many different meanings in a product making organization, but in this post we want to focus on testing as it relates to the *verification that a product does what it is intended to do and does so elegantly, efficiently, and correctly. *
Some might just distill testing down to something like “find all the bugs”. I love this because it introduces two important concepts to product development:
- Bug. A bug is simply any time a product does not behave the way someone thought it should. This goes way beyond crashes, data loss, and security problems. Quite literally, if a customer/user of your product experiences the unexpected then you have a bug and should record it in some database. This means by definition testing is not the only source of bugs, but certainly is the collection and management point for the list of all the bugs.
- Specification. In practice, deciding whether or not a bug is something that requires the product to change means you have a definition or of how a product should behave in a given context. When you decide the action to take on a bug that is done with a shared understanding across the team of what a product should be doing. While often viewed as “old school” or associated with a classic “waterfall” methodology, specifications are how the product team has some sense of “truth”. As a team scales this becomes increasingly important because many different people will judge whether something is a bug or not.
Testing is also relative to the product lifecycle as great testers understand one the cardinal rules of software engineering—change is the enemy of quality. Testers know that when you have a bug and you change the code you are introducing risk into a complex system. Their job is to understand the potential impact a change might have on the overall product and weigh that against the known/reported problem. Good testers do not just report on problems than need to be fixed, but also push back on changing too much at the wrong time because of potential impact. Historically, for every 10 changes made to a stable product, at least one will backfire and cause things to break somehow.
Taken together these concepts explain why testing is such a sophisticated and nuanced practice. It also explains why it requires a different perspective than that of the product manager or the developer.
Checks and Balances
The art and science of making things at any scale is a careful balance of specialized skills combined with checks and balances across those skills.
Testing serves as part of the checks and balances across specializations. They do this by making sure everyone is clear on what the goals are, what success looks like, how to measure that success, and how to repeat those measures as the project progresses. By definition, testing does not make the product. That puts them in the ideal position to be the conscience of the product. The only agenda testing has is to make sure what everyone signed up to do is actually happening and happening well. Testing is the source of truth for a product.
Some might say this is the product manager’s role or the dev/engineering manager’s role (or maybe design or ops). The challenge is that each of these roles has other accountabilities to the product and so are asked to be both the creator and judge of their own work. Just as product managers are able to drive the overall design and cohesiveness of a product (among other things) while engineering drives the architecture and performance (among other things), we don’t normally expect those roles to reverse and certainly not to be held by a single person.
One can see how this creates a balanced system of checks:
- Development writes the code. This is the ultimate truth of what a product does, but not necessarily what the team might want it to do. Development is protective of code and has one view of what to change, what are the difficult parts of code or what parts are easy. Development must balance adding and changing code across individual engineers who own different parts of the code and so on.
- Operations runs the live product/service. Working side by side with development (in a DevOps manner) there are the folks that scale a product up and out. This is also about writing the code and tools required to manage the service.
- Product management “designs” the product. I say design to be broader than Design (interaction, graphical, etc.) and to include the choice of features, target customers, and functional requirements.
- Product design defines how a product feels. Design determines the look and feel of a product, the interaction flows, and the techniques used to express features.
- And so on across many disciplines…
That also makes testing a big pain in the neck for some people. Testers want precision when it might not exist. Testers by their nature want to know things before they can be known. Testers by their nature prefer stability over change. Testers by their nature want things to be measurable even when they can’t be measured. Testers tend towards process or procedural thinking when others might tend towards non-linear thinking. We all know that engineers tilt towards wanting to distill things to 1’s and 0’s. To the uninitiated (or the less than amazing tester) testers can come across as even more binary than binary.
That said, all you need is testing to save you from yourself one time and you have a new best friend.
Why Do We (Still) Need Testing?
Software engineering is a unique engineering discipline. In fact for the whole history of the field different people have argued either that computer software is mostly a science of computing or that computing is a craft or artistic practice. We won’t settle this here. On the other hand, it is fair to say that at least two things are true. First, even art can have a technology component that requires an engineering like approach, for example making films or photography. Second, software is a critical part of society’s infrastructure and from electrical to mechanical to civil we require those disciplines to be engineers.
Software has a unique characteristic which is that it is actually the case that a single person can have an idea, write the code, and distribute it for use. Take that civil engineers! Good luck designing and building a bridge on your own. Because of this characteristic of software there is desire to scale to large projects this same way.
People who know about software bugs/defects know that there are two ways to reduce the appearance and cost of shipping bugs. First, don’t introduce them at all. Methodologies like extreme or buddy programming or code reviews are all about creating a coding environment that prevents bugs from ever being typed.
Yet those methods still yield bugs. So the other technique employed is to attempt to get engineering to test all the code they write and to move the bug finding efforts “upstream”. That is write some new code for the product and then write code that tests your code. This is what makes software creation seem most like other forms of engineering or product creation. The beauty of software is just how soft it is—complete redesigns are keystrokes away and only have a cost in brain power and time. This contrasts sharply with building roads, jets, bridges, or buildings. In those cases, mistakes are enormously costly and potentially very dangerous. Make a mistake on the load calculations of a building and you have to tear it down and start over (or just leave the building mostly empty like the Stata Center at MIT).
Therefore moving detection of mistakes earlier in the process is something all engineering works to do (though not always successfully). In all but software engineering, the standard of practice employs engineers dedicated to the oversight of other engineers. You can even see this in practice in the basics of building a home where you must enlist inspectors to oversee electrical or steel or drainage, even though the engineers presumably do all they can to avoid mistakes. On top of that there are basic codes that define minimal standards. Software lacks all of these as a formality.
Thus the importance of specialized testing in software projects is a pressing need that is often viewed as counter-cultural. Lacking the physical constraints as well, engineers tend to feel “gummed” up and constrained by what would be routine quality practices in other engineering. For example, no one builds as much as a kitchen cabinet without detailed drawings with measurements. Yet routinely we in software build products or features without specifications.
Because of this tension between acting like traditional engineers and working to maintain the velocity of a single inspired engineer, there’s a desire to coalesce testing into the role of the engineer which can potentially allow for more agility or moving bug finding more upstream. One of the biggest changes in the field of software has been the availability of data about product quality (telemetry) which can be used to inform a project team about the state of things, perhaps before the product is in broad use.
There’s some recent history in the desire to move testing and development together and that is the devops movement. Devops is about rolling in operational efforts closer to engineering to prevent the “toss it over the wall” approach used by earlier in the evolution of web services. I think this is both similar and different. Most of the devops movement focuses on the communication and collaboration between development and operations, rather than the coalescing of disciplines. It is hard to argue against more communication and certainly within my own experience, when it came time to begin planning, building, and operating services our view of Operations was that it was adding to a seat at the table of PM, dev, test, design, and more.
The real challenge is that testing is far more sophisticated than anything an engineer can do solo. The reason is that engineers are focused on adding new code and making sure the new code works the way they wrote it. That’s very different than focusing on all that new code in the context of all other new code, all the new hardware, and if relevant all the old code as well (compatibility). In other words, as a developer is writing new code the question is really if it is even possible for the developer to make progress on that code while thinking about all those other things. Progress will quickly grind to halt if one really tries to do all of that work well.
As an aside, the role of developers writing unit tests is well-established and quite successful. Historically the challenge is maintaining these over time at the same level of efficacy. In addition, going beyond unit testing to include automation, configuration, API, and more to areas that the individual developer lacks expertise proves out the challenge of trying to operate without dedicated testing.
An analogy I’ve often used is to compare software projects to movies (they share a lot of similarities). With movies you immediately think of actor, director, screenwriter and tools like cameras, lights, sound. Those are the engineer and product manager equivalents. Put a glass of iced tea in the hand of an actor and the sunset in the background and all of a sudden someone has to worry about the level of the tea, condensation, and ice cube volume along with the level of the sun and number of birds on the horizon. Now of course an actor knows how that looks and so does the director. Movies are complex—they are shot out of order, reshot, and from many angles. So movie sets employ people to keep an eye on all those things—property masters, continuity, and so on. While the idea of the actor or director or camera operator trying to remember the size of ice cubes is not difficult to understand intellectually, in practice those people have a ton of other things to worry about. In fact they have so much to worry about that there’s no way they can routinely remember all those details or keep the big issues of the film front and center. Those ice cubes are device compatibility. The count of birds represent compatibility with other features. The level of the sun represents something like alternative scripts or accessibility, for example. All these things are things that need to be considered across the whole production in a consistent and well-understood manner. There’s simply no way for each “actor” to do an adequate job on all of them.
Therefore like other forms of engineering, testing is not an optional thing just because one can imagine software being made by just pure coding. Testing is a natural outcome of a project of any sophistication, complexity, or evolution over time. When I do something like run Excel 3 from 1990 on Windows 8, I think there’s an engineering accomplishment but I really know that is the work of testers validating whole subsystems across a product.
When to Test
You can bring on test too early, whether a startup or an existing/large project. When you bring on testing before you have a firm grasp from product management of what an end state might look like, then there’s no role testing can play. Testing is a relative science. Testers validate a product relative to what it is supposed to do. If what it is supposed to do is either unknown or to be determined then the last thing you want is someone saying it isn’t doing something right. That’s a recipe for frustrating everyone. Development is told they are doing the wrong thing. Product will just claim the truth to be different. And thus the tension across the team described by Dave in his post will surface.
In fact a classic era in Microsoft’s history with testing and engineering is based on wanting to find bugs upstream so badly that the leaders at the time drove folks to test far too early and eagerly. What resulted was no less than a tsunami of bugs that overwhelmed development and the project ground to a halt. Valuable lessons were passed on about starting too early—when nothing yet works there’s no need to start testing.
While there is a desire to move testing more upstream, one must also balance this with having enough of the product done and enough knowledge of what the product should be before testing starts. Once you know that then you can’t cut corners and you have to give the testing discipline time to do their job with a product that is relatively stable.
That condition—having the product in a stable state—before starting testing is a source of tension. To many it feels like a serialization that should not be done. The way teams I’ve worked on have always talked about this is that final stages of any project are the least efficient times for the team. Essentially the whole team is working to validate code rather than change code. Velocity of product development seems to stand still. Yet that is when progress is being made because testing is gaining assurance that the product does what it is supposed to do, well.
The tools of testing that span from unit tests, API tests, security tests, ad hoc testing, code coverage, UX automation, compatibility testing, and automation across all of those are the way they do their job. So much of the early stages of a project can be spent creating and managing that infrastructure when that does not depend on the specifics of how the product will work. Grant George, the most amazing test leader I ever had the opportunity to work with on both Windows and Office, used to call this the “factory floor”. He likened this phase to building the machinery required for a manufacturing line which would allow the team to rapidly iterate on daily builds while covering the full scope of testing the product.
While you can test too early you can also test too late. Modern engineering is not a serial process. Testers are communicating with design and product management (just like a devops process would describe) all along, for example. If you really do wait to test until the product is done, you will definitely run out of time and/or patience. One way to think of this is that testers will find things to fix—a lot of things—and you just need time to fix them.
In today’s modern era, testing doesn’t end when the product releases. The inbound telemetry from the real world is always there informing the whole team of the quality of the product.
One of the most magical times I ever experienced was the introduction of telemetry to the product development process. It was recently the anniversary of that very innovation (called “Watson”) and Kirk Glerum, one of the original inventors back in the late 1990’s, noted so on Facebook. I just wanted to share this story a little bit because of how it showed a counter-intuitive notion of how testing evolved. (See this Facebook post from Kirk). This is not meant to be a complete history.
While working what became Office 2000 in 1998 or so, Kirk had the brilliant insight that when a program crashed one could use the *internet* and get a snapshot of some key diagnostics and upload those to Microsoft for debugging. Previously we literally had either no data or someone would call telephone support and fax in some random hex numbers being displayed on a screen. Threading the needle with our legal department, folks like Eric LeVine worked hard to provide all the right anonymization, opt-in, and disclosure required. So rather than have a sample of crashes run on specific or known machines, Kirk’s insight allowed Microsoft to learn about literally all the crashes happening. Very quickly Windows and Office began working together and Windows XP and Office 2000 released as the first products with this enabled.
A defining moment was when a well-known app from a third party released a patch. A lot of people were notified by some automated method and downloaded the patch and installed it. Except the patch caused a crash in Word. We immediately saw a huge spike in crashes all happening in the same place and quickly figured out what was going on and got in touch with the ISV. The ISV was totally unaware of the potential problem and thus began an industry wide push on this kind of telemetry and using this aspect of the Windows platform. More importantly a fix was quickly released.
An early reaction was that this type of telemetry would obsolete much of testing. We could simply have enough people running the product to find the parts that crashed or were slow (later advances in telemetry). Of course most bugs aren’t that bad but even assuming they were this automation of testing was a real thought.
But instead what happened was testing quickly became the best users of this telemetry data. They were using it while analyzing the code base, understanding where the code was most fragile, and thinking ways to gather more information. The same could be said for development. Believe it or not, some were concerned that development would get sloppy and introduce bugs more often knowing that if a bug was bad enough it would pop up on the telemetry reports. Instead of course development became obsessed with the telemetry and it became a routine part of their process as well.
The result was just better and higher quality software. As our industry proves time and time again, the improvements in tools allow the humans to focus on higher level work and to gain an even better understanding of the complexity that exists. Thus telemetry has become an integral part of testing much the same way that improvements in languages help developers or better UX toolkits help design.
It Takes a Village
Dave’s post on testing motivated me to write this. I’ve written posts about the role of design, product management, general management and more over the years as well. As “software eats the world” and as software continues to define the critical infrastructure of society, we’re going to need more and more specialized skills. This is a natural course of engineering.
When you think of all the specialties to build a house, it should not be surprising that software projects will need increasing specialization. We will need not just front end or back end developers, project managers, designers, and so on. We will continue to focus on having security, operations, linguistics, accessibility, and more. As software matures these will not be ephemeral specializations but disciplines all by themselves.
Tools will continue to evolve and that will enable individuals to do more and more. Ten years ago to build a web service your startup required people will skills to acquire and deploy servers, storage networks, and routers. Today, you can use AWS from a laptop. But now your product has a service API and integration with a dozen other services and one person can’t continuously integrate, test, and validate all of those all while still moving the product forward.
Our profession keeps moving up the stack, but the complexity only increases and the demands from customers for a always improving experience continues unabated.
PS: My all time favorite book on engineering and one that shaped a lot of my own views is To Engineer Is Human by Henry Petroski. It talks about famous engineering “failures” and how engineering is all about iteration and learning. To anyone that ever released a bug, this should make sense (hint, that’s every one of us).
We are trying to change a culture of compartmentalized, start-from-scratch style development here. I’m curious if there are any good examples of Enterprise “Open Source” that we can learn from.
—Question from reader with a strong history in engineering management
When starting a new product line or dealing with multiple existing products, there’s always a question about how to share code. Even the most ardent open source developers know the challenges of sharing code—it is easy to pick up a library of “done” code, not so hard to share something that you can snapshot, but remarkably difficult to share code that is also moving at a high velocity like your work.
Developers love to talk about sharing code probably much more than they love to share code in practice. Yet, sharing code happens all the time—everyone uses an OS, web server, programming languages, and more that are all shared code. Where it gets tricky is when the shared code is an integral part of the product you’re developing. That’s when shared code goes from “fastest way to get moving” to “a potential (difficult) constraint” or to “likely a critical path”. Ironically, this is usually more true inside of a single company where one team needs to “depend” on another team for shared code than it is on developers sharing code from outside the company.
Organizationally, sharing code takes on varying degrees of difficulty depending on the “org distance” between developers. For example, two developers working for the same manager don’t even think about “sharing code” as much as they think about “working together”. At the other end of the spectrum, developers on different products with different code bases (perhaps started at different times with early thoughts that the products were unrelated or maybe one code base was acquired) think naturally about shipping their code base and working on their product first and foremost.
This latter case is often viewed as an organizational silo—a team of engineering, testing, product, operations, design, and perhaps even separate marketing or P&L responsibility. This might be the preferred org design (focus on business agility) or it might be because of intrinsic org structures (like geography, history, leadership approach). The larger these types of organizations the more the “needs of the org” tend to trump the “needs of the code”.
Let’s assume everyone is well-meaning and would share code, but it just isn’t happening organically. What are 5 things the team overall can do?
Ship together. The most straight-forward attribute two teams can modify in order to effectively share code is to have a release/ship schedule that is aligned. Sharing code is the most difficult when one team is locked down and the other team is just getting started. Things get progressively easier the closer to aligned each team becomes. Even on very short cycles of 30-60 days, the difference in mindset about what code can change and how can quickly grow to be a share-stopper. Even when creating a new product alongside an existing product, picking a scheduling milestone that is aligned can be remarkably helpful in encouraging sharing rather than a “new product silo” which only digs a future hole that will need to be filled.
Organize together to engineer together. If you’re looking at trying to share code across engineering organizations that have an org distance that involves general management, revenue or P&L, or different products, then there’s an opportunity to use organization approaches to share code. When one engineering manager can look at a shared code challenge across all of his/her responsibilities there more of a chance that an engineering leader will see this as an opportunity rather than a tax/burden. The dialog about efficacy or reality of sharing code does not span managers or importantly disciplines, and the resulting accountability rests within straight-forward engineering functions. This approach has limits (the graph theory of org size as well as the challenges of organizing substantially different products together).
Allocate resources for sharing. A large organization that has enough resources to duplicate code turns out to be the biggest barrier to sharing code. If there’s a desire to share code, especially if this means re-architecting something that works (to replace it with some shared code, presumably with a mutual benefit) then the larger team has a built-in mechanism to avoid the shared code tax. As painful as it sounds, the most straight-forward approach to addressing this challenge is to allocate resources such that a team doesn’t really have the option to just duplicate code. This approach often works best when combined with organizing together, since one engineering manager can simply load balance the projects more effectively. But even across silos, careful attention (and transparency) to how engineering resources are spent will often make this approach attainable.
Establish provider/consumer relationships. Often shared code can look like a “shared code library” that needs to be developed. It is quite common and can be quite effective to form a separate team, a provider, that exists entirely to provide code to other parts of the company, a consumer. The consumer team will tend to look at the provider team as an extension to their team and all can work well. On the other hand, there are almost always multiple consumers (otherwise the code isn’t really shared) and then the challenges of which team to serve and when (and where requirements might come from) all surface. Groups dedicated to being the producers of shared code can work, but they can quickly take on the characteristics of yet another silo in the company. Resource allocation and schedules are often quite challenging with a priori shared code groups.
Avoid the technical buzz-saw. Developers given a goal to share code and a desire to avoid doing so will often resort to a drawn-out analysis phase of the code and/or team. This will be thoughtful and high-integrity. But one person’s approach to being thorough can also look to another as a delay or avoidance tactic. No matter how genuine the analysis might be, the reality is that it can come across as a technical buzz-saw making all but the most idealized code sharing impossible. My own experience has been that simply avoiding this process is best—a bake-off or ongoing suitability-to-task discussion will only drive a wedge between teams. At some level sharing code is a leap of faith that a lot of folks need to take and when it works everyone is happy and if it doesn’t there’s a good chance someone is likely to say “told you so”. Most every bet one makes in engineering has skeptics. Spending some effort to hear out the skeptics is critical. A winners/losers process is almost always a negative for all involved.
The common thread about all of these is that they all seem impossible at first. As with any initiative, there’s a non-zero cost to obtaining goals that require behavior change. If sharing code is important and not happening, there’s a good chance you’re working against some of the existing constraints in the approach. Smart and empowered teams act with the best intentions to balance a seemingly endless set of inbound issues and constraints, and shared code might just be one of those things that doesn’t make the cut.
Keeping in mind that at any given time an engineering organization is probably overloaded and at capacity just getting stuff done, there’s not a lot of room to just overlay new goals.
Sharing code is like sharing any other aspect of a larger team—from best practices in tools, engineering approaches, team management—things don’t happen organically unless there’s a uniform benefit across teams. The role of management is to put in place the right constraints that benefit the overall goals without compromising other goals. This effort requires ongoing monitoring and feedback to make sure the right balance is achieved.
For those interested in some history, this is a Harvard Business School case on the very early Office (paid article) team and the challenges/questions around organizing around a set of related products (hint, this only seems relatively straight-forward in hindsight).
Targeting multiple operating systems has been an industry goal or non-goal depending on your perspective since some of the earliest days of computing. For both app developers and platform builders, the evolution of their work follow typical patterns—patterns where their goals might be aligned or manageable in the short term but become increasingly divergent over time.
While history does not always repeat itself, the ingredients for a repeat of cross-platform woes currently exist in the domain of mobile apps (mobile means apps developed for modern sealed-case platforms such as iOS, Android, Windows RT, Windows Phone, Blackberry, etc.) The network effects of platforms and the “winner take all” state that many believe is reached (or perhaps desirable) influences the behavior and outcome of cross-platform app development as well as platform development.
Today app developers generally write apps targeting several of the mobile platforms. If you look at number of “sockets” over the past couple of years there was an early dominance of iOS followed by a large growth of Android. Several other platforms currently compete for the next round of attention. Based on apps in respective app stores these are two leaders for the new platforms. App developers today seeking the most number of client sockets target at least iOS and Android, often simultaneously. It is too early to pick a winner.
Some would say that the role of the cloud services or the browser make app development less about the “client” socket. The data, however, suggests that customers prefer the interaction approach and integration capability of apps and certainly platform builders touting the size of app stores further evidences that perspective. Even the smallest amount of “dependency” (for customers or technical reasons) on the client’s unique capabilities can provide benefits or dramatically improve the quality of the overall experience.
In discussions with entrepreneurs I have had, it is clear the approach to cross-platform is shifting from “obviously we will do multiple platforms” to thinking about which platform comes first, second, or third and how many to do. Chris Dixon recently had some thoughts about this in the context of modern app development in general (tablets and/or phones). I would agree that tablets drive a different type of app over time simply because the scenarios can be quite different even with identically capable devices under the hood. The cross-platform question only gets more difficult if apps take on unique capabilities or user experiences for different sized screens, which is almost certainly the case.
The history of cross-platform development is fairly well-known by app developers.
The goal of an app developer is to acquire as many customers as possible or to have the highest quality engagement with a specific set of customers. In an environment where customers are all using one platform (by platform we mean set of APIs, tools, languages that are used to build an app) the choice for a developer is simple, which is to target the platform APIs in a fully exploitive manner.
The goal of being the “best” app for the platform is a goal shared by both app and platform developers. The reason for this is that nearly any app will have app competitors and one approach to differentiation will be to be the app that is best on the platform—at the very least this will garner the attention of the platform builder and result in amplification of the marketing and outreach of a given app (for example, given n different banking apps, the one that is used in demonstrations or platform evangelism will be the one that touts the platform advantages).
Once developers are faced with two or more platforms to target, the discussion typically starts with attempting to measure the size of the customer base for each platform (hence the debate today about whether market share or revenue define a more successful platform). New apps (at startups or established companies) will start with a dialog that depending on time or resources jumps through incredible hoops to attempt to model the platform dynamics. Questions such as which customers use which platforms, velocity of platform adoption, installed base, likelihood of reaching different customers on platforms, geography of usage, and pretty much every variable imaginable. The goal is to attempt to define the market impact of either support multiple platforms or betting on one platform. Of course none of these can be known. Observer bias is inherent in the process only because this is all about forecasting a dynamic system based on the behavior of people. But basing a product plan on a rapidly evolving and hard to define “market share” metric is fraught with problems.
During this market sizing debate, the development team is also looking at how challenging cross platform support can be. While mostly objective, just as with the market sizing studies, bias can sneak in. For example, if the developers’ skills align with one platform or a platform makes certain architectural assumptions that are viewed favorably then different approaches to platform choices become easy or difficult.
Developers that are fluent in HTML might suggest that things be done in a browser or use a mobile browser solution. Even the business might like this approach because it leverages an existing effort or business model (serving ads for example). Some view the choices Facebook made for their mobile apps as being influenced by these variables.
As the dialog continues, developers will tend to start to see the inherent engineering costs in trying to do a great job across multiple platforms. They will start to think about how hard it is to keep code bases in sync or where features will be easier/harder or appear first or even just sim-shipping across platforms. Very quickly developers will generally start to feel pulled in an impossible direction by having to be across multiple platforms and that it is just not viable to have a long-term cross-platform strategy.
The business view will generally continue to drive a view that the more sockets there are the better. Some apps are inherently going to drive the desire or need for cross-platform support. Anything that is about communications for example will generally argue for “going where the people are” or “our users don’t know the OS of their connected endpoints” and thus push for supporting multiple platforms. Apps that are offered as free front ends for services (online banking, buying tickets, or signing up for yoga class) will also feel pressures to be where the customers are and to be device agnostic. As you keep going through scenarios the business folks will become convinced that the only viable approach is to be on all the popular platforms.
That puts everyone in a very tense situation—everyone is feeling stressed about achieving success. Something has to give though.
We’ve all been there.
The industry has seen this cross-platform movie several times. It might not always be the same and each generation brings with it new challenges, new technologies, and potentially different approaches that could lead to different outcomes. Knowing the past is important.
Today’s cross-platform challenge can be viewed differently primarily because of a few factors when looking at the challenge from an app developer / ISV:
- App Services. Much of the functionality for today’s apps resides on software as a service infrastructure. The apps themselves might be viewed as fairly lightweight front ends to these services, at least for some class of apps or some approaches to app building. This is especially true today as most apps are still fairly “first generation”.
- Languages and tools. Today’s platforms are more self-contained in that the languages and tools are also part of the platform. In previous generations there were languages that could be used across different platforms (COBOL, C, C++) and standards for those languages even if there were platform-specific language extensions. While there are ample opportunities for shared libraries of “engine” code in many of today’s platforms, most modern platforms are designed around a heavy tilt in favor of one language, and those are different across platforms. Given the first point, it is fair to say that the bulk of the code (at least initially) on the device will be platform specific anyway.
- Integration. Much of what goes on in apps today is about integration with the platform. Integration has been increasing in each generation of platform shifts. For example, in the earliest days there was no cross-application sharing, then came the basics through files, then came clipboard sharing. Today sharing is implicit in nearly every app in some way.
Even allowing for this new context, there is a cycle at work in how multiple, competing platforms evolve.
This is a cycle so you need to start somewhere.
Initially there is a critical mass around one platform. As far as modern platforms go when iOS was introduced it was (and remains) unique in platform and device attributes so mobile apps had one place to go and all the new activity was on that platform. This is a typical first-mover scenario in a new market.
Over time new platforms emerge (with slightly different characteristics) creating a period of time where cross-platform work is the norm. This period is supported by the fact that platforms are relatively new and are each building out the base infrastructure which tends to look similar across the new platforms.
There are solid technical reasons why cross-platform development seems to work in the early days of platform proliferation. When new platforms begin to emerge they are often taking similar approaches to “reinventing” what it means to be a platform. For example, when GUI interfaces first came about the basics of controls, menus, and windows were close enough that knowledge of one platform readily translated to other platforms. It was technically not too difficult to create mapping layers that allowed the same code to be used to target different platforms.
During this phase of platform evolution the platforms are all relatively immature compared to each other. Each is focused on putting in place the plumbing that approaches platform design in this new shared view. In essence the emerging platforms tend to look more similar that different. The early days of web browsers–which many believed were themselves platforms–followed this pattern. There was a degree of HTML that was readily shared and consistent across platforms. At least this was the case for a while.
During this time there is often a lot of re-learning that takes place. The problems solved in the previous generation of platforms become new again. New solutions to old problems arise, sometimes frustrating developers. But this “new growth” also brings with it a chance to revisit old assumptions and innovate in new ways, even if the problem space is the same.
Even with this early commonality, things can be a real challenge. For example, there is a real desire for applications to look and feel “native”. Sometimes this is about placement of functionality such as where settings are located. It could be about the look or style of graphical elements or the way visual aspects of the platform are reflected in your app. It could also be about highly marketed features and how well your app integrates as evidence for supporting the platform.
Along the way things begin to change and the platforms diverge because of two factors. First, once the plumbing common to multiple early platforms is in place, platform builders begin to express their unique point of view of how platform services experiences should evolve. For example, Android is likely to focus on unique services and how the client interacts with and uses those services. To most, iOS has shown substantially more innovation in client-side innovation and first-party experiences. The resulting APIs exposed to developers start to diverge in capabilities and new API surface areas no longer seem so common between platforms.
Second, competition begins to drive how innovation progresses. While the first mover might have one point of view, the second (or third) mover might take the same idea of a service or API but evolve it slightly differently. It might integrate with backends differently or it might have a very different architecture. The role of voice input/reco, maps, or cloud storage are examples of APIs that are appearing on platforms but the expression of those APIs and capabilities they support are evolving in different ways that there are no obvious mappings between them.
As the platforms diverge developers start to make choices about what APIs to support on each platform or even which platforms to target. With these choices come a few well known challenges.
- Tools and languages. Initially the tools and languages might be different but things seem manageable. In particular, developers look to put as much code in common languages (“platform agnostic code”) or implement code as a web service (independent of the client device). This is a great strategy but does not allow for the reality that a good deal of code (and differentiation) will serve as platform-specific user experience or integration functionality. Early on tools are relatively immature and maybe even rudimentary which makes it easier to build infrastructure around managing a cross-platform project. Over time the tools themselves will become more sophisticated and diverge in capabilities. New IDEs or tools will be required for the platforms in order to be competitive and developers will gravitate to one toolset, resulting in developers themselves less able to context switch between platforms. At the very least, managing two diverging code bases using different tools becomes highly challenging–even if right now some devs think they have a handle on the situation.
- User interaction and design (assets). Early in platform evolution the basics of human interaction tend to be common and the approaches to digital assets can be fairly straight forward. As device capabilities diverge (DPI, sensors, screen sizes) the ability for the user interaction to be common also diverges. What works on one platform doesn’t seem right on another. Tablet sized screens introduce a whole other level of divergence to this challenge. Alternate input mechanisms can really divide platform elements (voice, vision, sensors, touch metaphors).
- Platform integration. Integrating with a platform early on is usually fairly “trivial”. Perhaps there are a few places you put preferences or settings, or connect to some platform services such as internationalization or accessibility. As platforms evolve, where and how to integrate poses challenges for app developers. Notifications, settings, printing, storage, and more are all places where finding what is “common” between platforms will become increasingly difficult to impossible. The platform services for identity, payment, and even integration with third party services will become increasingly part of the platform API as well. When those APIs are used other benefits will accrue to developers and/or end-users of apps—and these APIs will be substantially different across platforms.
- More code in the service. The new platforms definitely emphasize code in services to provide a way to be insulated from platform changes. This is a perfect way to implement as much of your own domain as you can. Keep in mind that the platforms themselves are evolving and growing and so you can expect services provided by the platform to be part of the core app API as well. Storage is a great example of this challenge. You might choose to implement storage on your own to avoid a platform dependency. Such an approach puts you in the storage business though and probably not very competitively from a feature or cost perspective. Using a third party API can pose the same challenge as any cross-platform library. At the same time, the platforms evolve and likely will implement storage APIs and those APIs will be rich and integrate with other services as well.
- Cross-platform libraries. One of the most common approaches developers attempt (and often provided by third parties as well) is to develop or use a library that abstracts away platform differences or claims to map a unique “meta API” to multiple platforms. These cross—platform libraries are conceptually attractive but practically unworkable over time. Again, early on this can work. Over time the platform divergence is real. There’s nothing you can do to make services that don’t exist on one platform magically appear on another or APIs that are architecturally very different morph into similar architectures. Worse, as an app developer you end up relying on essentially a “shadow” OS provided by a team that has a fraction of the resources for updates, tooling, documentation, etc. even if this team is your own dev team. As a counter example, games commonly use engines across platforms, but they rely on a very narrow set of platform APIs and little integration. Nevertheless, there are those that believe this can be a path (as it is for games). It is important to keep in mind that the platforms are evolving rapidly and the customer desire for well-integrated apps (not just apps that run).
- Multiple teams. Absent the ability to share app client code (because of differing languages), keeping multiple teams in sync on the same app is extremely challenging. Equally challenging is having one team time slice – not only is that mentally inefficient, maintaining up to date skills and knowledge for multiple platforms is challenging. Even beyond the basics of keeping the feature set the same, there are problems to overcome. One example is just timing of releases. It might be hard enough to keep features in sync and sim ship, but imagine that the demand spike for a new release of your app when the platform changes (and maybe even requires a change to your app). You are then in a position to need a release for one platform. But if you are halfway done with new features for your app you have a very tricky disclosure and/or code management challenge. These challenges are compounded non-linearly as the number of platforms increases.
These are a few potential challenges. Not every app will run into these and some might not even be real challenges for a particularly app domain. By and large, these are the sorts of things that have dogged developers working cross-platform approaches across clients, servers, and more over many generations.
The obvious question will continue to be debated, which is if there is a single platform winner or not. Will developers be able to pick a platform and reach their own business and product goals by focusing on one platform as a way of avoiding the issues associated with supporting multiple platforms?
The only thing we know for sure is that the APIs, tools, and approaches of different platforms will continue to evolve and diverge. Working across platforms will only get more difficult, not easier.
The new platforms moved “up the stack” and make it more difficult for developers to isolate themselves from the platform changes. In the old days, developers could re-implement parts of the platform within the app and use that across platforms or even multiple apps. Developers could hook the system and customize the behavior as they saw fit. The more sealed nature of platforms (which delivers amazing benefits for end-users) makes it harder for developers to create their own experience and transport it across platforms. This isn’t new. In the DOS era, apps implemented their own printing subsystems and character-based user models all of which got replaced by GUI APIs all to the advantage of developers willing to embrace the richer platforms and to the advantage of customers that gained a new level of capabilities across apps.
The role of app stores and approval processes, the instant ability for the community to review apps, and the need to break through in the store will continue to drive the need to be great apps on the chosen platforms.
Some will undoubtedly call for standards or some homogonization of platforms. Posix in the OS world, Motif in the GUI world, or even HTML for browsers have all been attempts at this. It is a reasonable goal given we all want our software investments to be on as many devices as possible (and this desire is nothing new). But is it reasonable to expect vendors to pour billions into R&D to support an intentional strategy of commoditization or support for a committee design? Vendors believe we’re just getting started in delivering innovation and so slowing things down this way seems counter-intuitive at best.
Ultimately, the best apps are going to break through and I think the best apps are going to be the ones that work with the platform not in spite of it and the best apps won’t duplicate code in the platform but work with platform.
It means there some interesting choices ahead for many players in these ecosystems.
# # # # #
In a previous post, the topic of surviving legacy code was discussed. Browsers (or rendering engines within browsers) represent an interesting case of mission critical code as described in the post. A few folks noticed yesterday that Google has started a new rendering engine based on the WebKit project (“This was not an easy decision.” according to the post)
Relative to moving legacy code forward this raises some interesting product development challenges. This blog focuses on product development and the tradeoffs that invariably arise, and definitely not about being critical or analyzing choices made by others, as there are many other places to gain those perspectives. It is worth looking at actions through the lens of the product development discipline.
In this specific case there is an existing code base, legacy code, and a desire to move the code base forward. Expressed in the announcement, however briefly, is the architectural challenge faced by maintaining the multi-process architecture. Relative to the taxonomy from the previous post, this is a clear case of the challenges of moving an architecture forward. The challenge is pretty cut and dry.
The approach taken is one that looks very much a break in the evolution of the code base, a “fork” as described some. Also at work are efforts after forking to delete unused code, which is another technique for managing legacy code described previously. These are perfectly reasonable ways to move a code base forward, but also come with some challenges worth discussing.
What the fork?
(OK, I couldn’t resist that, or the title of this post).
Forking a code base is not just something one can do in the open source world, though there is somewhat of a special meaning there. It is a general practice applicable to any code base. In fact, robust source code control systems are deliberate in supporting forks because that is how one experiments on a code base, evolves it asynchronously, or just maintains distinct versions of the code.
A fork can be a temporary state, or sometimes called a branch when there are several and the intent to be temporary is clear. This is what one does to experiment on an alternate implementation or experiment on a new feature. After the experiment the changes are merged back in (or not) and the branch is closed off. Evolution of the code base moves forward as a singular effort.
A fork can also be permanent. This is where one can either reap significant benefits or introduce significant challenges, or both, in evolving the code. One can imagine forks that look like one of these two:
In the first case, the two paths stay in parallel. That’s an interesting approach. It is essentially saying that the code will do the same thing, but differently. In code one would use this approach if you wanted to maintain two variations of the same product but have different teams working on them. The differences between the two forks are known and planned. There’s a routine process for sharing changes as each of the branches evolve. In many ways, one could view the current state of webkit as this state since at no point is there a definitive version in use by every party. You might just call this type of fork a parallel evolution.
In the second case, the two paths diverge and diverge more over time. This too is an interesting approach. This type of fork is a one-time operation and then the evolution of each of the branches proceeds at the discretion of each development team. This approach says that the goals are no longer aligned and different paths need to be followed. There’s no limitation to sharing or merging changes, but this would happen opportunistically, not systematically. Comments from both resulting efforts of the WebKit fork reinforce the loosely coupled nature of the fork, including deleting the code unused by the respective forks along with a commitment to stay in communication.
For any given project, both of these could be appropriate. In terms of managing legacy code, both are making the statement that the existing code is no longer on the right evolutionary path—whether this is a technical, business, or engineering challenge.
Forking is a revolutionary change to a code base. It is sort of the punctuation in a punctuated equilibrium. It is an admission that the path the code and team were on is no longer working.
The most critical choice to make when forking code is to have an understanding of where the functionality goes. In the taxonomy of managing legacy code, a fork is a reboot, not a recast.
From a legacy code perspective, the choice to fork is the same as a choice to rewrite. Forking is just an expedient way to get started. Rather than start from an empty source tree, one can visualize the fork as a tree copy of all the existing code to a new project and a fast start. This isn’t cheating. It can be a big asset or a big liability.
As an asset, if you start from all the same existing code then the chances of being compatible in terms of features, performance, and quality are pretty high. Early in the project your code base looks a lot like the one you started from. The differences are the ones you immediately introduce—deleting code you don’t think you need, rewriting some parts critical to you, refactoring/restructuring for better engineering. All of these are software changes and that means, definitionally, there will be regressions relative to the starting point in the neighborhood of 10%.
On the other hand, a fork done this way can also introduce a liability. If you start from the same code you were just using, then you bring with it all the architecture and features that you had before of course. The question becomes what were you going away from? What was it that could not be worked into the code base the way it stood? The answers to these questions can provide insights into the balance between maintaining exact functionality out of the gate and how fast and well you can evolve towards your new goals down the road.
In both cases, the functionality of the other fork is not standing still (though on a project where your team controls both forks, you can decide resource levels or amount of change tolerated in one or the other fork). The functionality of the two code bases will necessarily diverge just because everything would need to be done twice and the same way, which will prove to be impossible. In the case of WebKit it is worth noting that it was derived from a fork of KHTML, which has since had a challenging path (see http://en.wikipedia.org/wiki/WebKit).
Point of view required
As said, the process of rebooting via any means is a perfectly viable way to move forward in the face of legacy code challenges. What makes it possible to understand a decision to fork is having (or communicating) a point of view as to why a fork (a reboot, rewrite) is the right approach. A point of view simply says what problem is being solved and why the approach solves the problem in a robust manner.
To arrive at such a conclusion, the team needs to have an open and honest dialog about the direction things need to go and the capabilities of the team and existing code to move forward. Not everyone will ever agree—engineers are notoriously polarizing, or some might say “religious”, at moments like this. Those that wrote the code are certain they know how to move it forward. Those that did not write the code cannot imagine how it could possibly move forward. All want ways to code with minimal distraction from their highest priorities. Open minds, experimentation, and sharing of data are the tools for the team to use to work (and work it is) to a shared approach for the fork to work.
If the team chooses a reboot the critical information to articulate is the point of view of “why”. In other words, what are assumptions about the existing code are no longer valid in some new direction or strategy. Just as critically are the new bets or new assumptions that will drive decision making.
This is not a story for the outside world, but is critical to the successful engineering of the code. You really need to know what is different—and that needs to map to very clear choices where one set of assumptions leads to one implementation and another set of assumptions leads to very different choices. Open source turns this engineering dialog into an externally visible dialog between engineers.
Every successful fork is one that has a very clear set of assumptions that are different from the original code base.
If you don’t have a different set of assumptions that are so clearly different to the developers doing the work, then the chances are you will just be forked and not really drive a distinct evolutionary path in terms of innovation.
Knowing this point of view – what are the pillars driving a change in code evolution – turns into the story that will get told when the next product releases. This story will not only need to explain what is new, but ultimately as a matter of engineering, will need to explain to all parties why some things don’t quite work the way they do with the other fork, past or present at time of launch.
If you don’t have this point of view when you start the project, you’re not going to be able to create one later in the project. The “narrative” of a project gets created at the start. Only marketing and spin can create a story different than the one that really took place.
In the software industry, legacy code is a phrase often used as a negative by engineers and pundits alike to describe the anchor around our collective necks that prevents software from moving forward in innovative ways. Perhaps the correlation between legacy and stagnation is not so obvious—consider that all code is legacy code as soon it is used by customers and clouds alike.
Legacy code is everywhere. Every bit of software we use, whether in an app on a phone, in the cloud, or installed on our PC is legacy code. Every bit of that code is being managed by a team of people who need to do something with it: improve it, maintain it, age it out. The process of evolving code over time is much more challenging than it appears on the face of it. Much like urban planning, it is easy to declare there should be mass transit, a new bridge, or a new exit, but figuring out how to design and engineer a solution free of disruptions or worse is extremely challenging. While one might think software is not concrete and steel, it has a structural integrity well beyond the obvious.
One of the more interesting aspects of Lean Startup for me is the notion of building products quickly and then reworking/pivoting/redoing them as you learn more from early adopters. This works extremely well for small code and customer bases. Once you have a larger code base or paying [sic] customers, there are limits to the ability to rewrite code or change your product, unless the number of new target customers greatly exceeds the number of existing customers. There exists a potential to slow or constrain innovation, or the reduced ability to serve as a platform for innovation. So while being free of any code certainly removes any engineering constraint, few projects are free of existing code for very long.
We tend to think of legacy code in the context of large commercial systems with support lifecycles and compatibility. In practice, lifting the hood of any software project in use by customers will have engineers talking about parts of the system that are a combination of mission critical and very hard to work near. Every project has code that might be deemed too hot to handle, or even radioactive. That’s legacy code.
This post looks at why code is legacy so quickly and some patterns. There’s no simple choice as to how to move forward but being deliberate and complete in how you do turns out to be the most helpful. Like so many things, this product development challenge is highly dependent on context and goals. Regardless, the topic of legacy is far more complex and nuanced than it might appear.
One person’s trash is another’s treasure
Whether legacy code is part of our rich heritage to be brought forward or part of historical anomalies to be erased from usage is often in the eye of the beholder. The newer or more broadly used some software is the more likely we are to see a representation of all views. The rapid pace of change across the marketplace, tools and techniques (computer science), and customer usage/needs only increases the velocity code moves to achieve legacy status.
In today’s environment, it is routine to talk about how business software is where the bulk of legacy code exists because businesses are slow to change. The inability to change quickly might not reflect a lack of desire, but merely prudence. A desire to improve upon existing investments rather than start over might be viewed as appropriately conservative as much as it might be stubborn and sticking to the past.
Business software systems are the heart and soul of what differentiates one company’s offering from another. These are the treasures of a company. Think about the difference between airlines or banks as you experience them. Different companies can have substantially different software experiences and yet all of them need to connect to enormously complex infrastructures. This infrastructure is a huge asset for the company and yet is also where changes need to happen. These systems were all created long before there was an idea of consumers directly accessing every aspect of the service. And yet with that access has come an increasing demand for even more features and more detailed access to the data and services we all know are there. We’re all quick to think of the software systems as trash when we can’t get the answer or service we want when we want it when we know it is in there somewhere.
Businesses also run systems that are essential but don’t necessarily differentiate one business from another or are just not customer facing. Running systems internally for a company to create and share information, communicate, or just run the “plumbing” of a company (accounting, payroll) are essential parts of what make a company a company. Defining, implementing, and maintaining these is exactly the same amount of work as the customer facing systems. These systems come with all the same burdens of security, operations, management, and more.
Only today, many of these seem to have off-the-shelf or cloud alternatives. Thus the choices made by a company to define the infrastructure of the company quickly become legacy when there appear to be so many alternatives entering the marketplace. To the company with a secure and manageable environment these systems are assets or even treasures. To the folks in a company “stuck” using something that seems more difficult or worse than something they can use on the web, these seem like crazy legacy systems, or maybe trash.
Companies, just as cities, need to adapt and change and move forward. There’s not an option to just keep running things as they are—you can’t grow or retain customers if your service doesn’t change but all the competitors around you do. So your treasure is also your legacy—everything that got you to where you are is also part of what needs to change.
Thinking about the systems consumers use quickly shows how much of the consumer world is burdened by existing software that fits this same mold—is the existing system trash or treasure? The answer is both and it just depends on who you ask or even how you ask.
Consumer systems today are primarily service-based. As such the pace of change is substantially different from the pace of change of the old packaged software world since changes only need take place at the service end without action by consumers. This rapid pace of change is almost always viewed as a positive, unless it isn’t.
The services we all use are amazing treasures once they become integral to our lives. Mail, social networking, entertaining, as well as our banking and travel tools are all treasures. They can make our lives easier and more fun. They are all amazing and complex software systems running at massive scale. To the companies that build and run these systems, they are the company treasures. They are the roads and infrastructure of a city.
If you want to start an uproar with a consumer service, then just change the user interface a bit. One day your customers (users, people) sign on and there’s a who moved my cheese moment. Unlike the packaged software world, no choice was made no time was set aside, rather just when you needed to check your mail, update status, or read some news everything is different. Generally the more acute your experience is the more wound up you get about the change. Unlike adding an extra button on an already crowded toolbar, a menu command at the end of a long menu, or just a new set of optional customizations, this in your face change is very rarely well-received.
Sometimes you don’t even need to change your service, but just say you’re going to shut it down and no longer offer it. Even if the service hasn’t changed in a long time or usage has not increased, all of a sudden that legacy system shows up as someone’s treasure. City planners trying to find new uses for a barely used public facility or rezone a parking lot often face incredible resistance from a small but stable customer population, even if the resources could be better used for a more people. That old abandoned building is declared an historic landmark, even if it goes unused. No matter how low the cost or how rich the provider, resources are finite.
The uproar that comes from changing consumer software represents customers clamoring for a maintaining the legacy. When faced with a change, it is not uncommon to see legacy viewed as a heritage and not the negatives usually associated with software legacy.
Often those most vocal about the topic have polarizing views on changes. Platforms might be fragmented and the desire is expressed to get everyone else to change their (browser, runtime, OS) to keep things modern and up to date—and this is expressed with extreme zest for change regardless of the cost to others. At the same time, things that impact a group of influentials or early adopters are most assailed when they do change in ways that run counter to convential wisdom.
Somewhere in this world where change and new are so highly valued and same represents old and legacy, is a real product development challenge. There are choices to be made in product development about the acceptance and tolerance of change, the need to change, and the ability to change. These are questions without obvious answers. While one person’s trash is another’s treasure makes sense in the abstract, what are we to do when it comes to moving systems forward.
Let’s assume it is impossible to really say whether code is legacy to be replaced or rewritten or legacy to be preserved and cherished. We should stipulate this because it doesn’t really matter for two reasons:
- Assuming we’re not going to just shut down the system, it will change. Some people will like the change and other’s will not. One person’s treasure is another’s trash.
- Software engineering is a young and evolving field. Low-level architecture, user interaction, core technologies, tools, techniques, and even tastes will change, and change dramatically. What was once a treasured way to implement something will eventually become obsolete or plain dumb.
These two points define the notion that all existing code is legacy code. The job of product development is to figure out which existing code is a treasure and which is trash.
It is worth having a decision framework for what constitutes trash for your project. Part of every planning process should include a deliberate notion of what code is being treated as trash and what code is a treasure. The bigger the system, the more important it is to make sure everyone is on the same page in this regard. Inconsistencies in how change is handled can lead to frustrated or confused customers down the road.
Written with different assumptions
When a system is created, it is created with a whole host of assumptions. In fact, a huge base of assumptions are not even chosen deliberately at the start of a project. From the programming language to the platform to the basic architecture are chosen rather quickly at the start of a project. It turns out these put the system on a trajectory that will consistently reinforce assumptions.
We’ve seen detailed write-ups of the iOS platform and the evolution of apps relative to screen attributes. On the one hand developers coding to iOS know the specifics of the platform and can “lock” that assumption—a treasure for everyone. Then characteristics of screens potentially change (ppi, aspect ratio, size) and the question becomes whether preserving the fixed point is “supporting legacy” or “holding back innovation”.
While that is a specific example, consider broader assumptions such as bandwidth, cpu v. gpu capability, or even memory. An historic example would be how for the first ten years of PC software there was an extreme focus on reducing the amount of memory or disk storage used by software. Y2K itself was often blamed on people trying to save a few bits in memory or on disk. Structures were packed. Overlays were used. Data stored in binary on disk.
Then one day 32-bits, virtual memory and fast gigabyte disks become normal. For a short time there was a debate about sloppy software development (“why use 32 bits to represent 0-255?”) but by and large software developers were making different assumptions about what was the right starting point. Teams went through code systematically widening words, removing complexity of the 16 bit address space, and so on.
These changes came with a cost—it took time and effort to update applications for a new screen or revisit code for bit-packing assumptions. These seem easy and right in hindsight—these happen to be transparent to end-users. But to a broad audience these changes were work and the assumptions built into the code so innocently just became legacy.
It is easy for us to visualize changes in hardware driving these altered assumptions. But assumptions in the software environment are just as pervasive. Concepts ranging from changes in interaction widgets (commands to toolbars to context sensitive) to metaphors (desktop or panels) or even assumptions about what is expected behavior (spell checking). The latter is interesting because the assumption of having a local dictionary improve over time and support local custom dictionaries was state of the art. Today the expectation is that a web service is the best way to know how to spell something. That’s because you can assume connectivity and assume a rich backend.
When you start a new project, you might even take a step back and try to list all of the assumptions you’re making. Are you assuming screen size or aspect ratio, keyboard or touch, unlimited bandwidth, background processing, single user, credit cards, left to right typing, or more. It is worth noting that in the current climate of cross-platform development, the assumptions made on target platforms can differ quite a bit—what is easy or cheap on one platform might be impossible or costly on another. So your assumptions might be inherited from a target platform. It is rather incredible the long list of things one might assume at the start of a project and each of those translates into a potential roadblock into evolving your system.
Evolved views of well-architected
Software engineering is one of the youngest engineering disciplines. The whole of the discipline is a generation, particularly if you consider the micro-processor based view of the field. As defined by platforms, the notion of what constitutes a well-architected system is something that changes over time. This type of legacy challenge is one that influences engineers in terms of how they think about a project—this is the sort of evolution that makes it easy or difficult to deliver new features, but might not be visible to those using the system.
As an example, the evolution of where code should be executed in a system parallels the evolution of software engineering. From thin-client mainframes to rich-client tightly-coupled client/server to service-oriented architecture we see very different views of the most fundamental choice about where to put code. From modular to structured to object-oriented programming and more we see fundamentally different choices about how to structure code. From a focus on power, cores, and compute cycles to graphics, mobility, and battery life we see dramatic changes in what it means to be modern and well-architected.
The underlying architecture of a system affords developers a (far too) easy way to declare something as legacy code to be reworked. We all know a system written in COBOL is legacy. We all know if a system is a stateful client application to install in order to use the system it needs to be replaced.
When and how to make these choices is much more complex. These systems are usually critical to the operations of a business and it is often entirely possible (or even easier) to continue to deliver functionality on the existing system rather than attempt to replace the system entirely.
One of the most eye-opening examples of this for me is the description of the software developed for the Space Shuttle, which is a long-term project with complexity beyond what can even be recreated, see Architecture of the space shuttle primary avionics software system. The state of the art in software had moved very far, but the risks or impossibility of a modern and current architecture outweighed the benefits. We love to say that not every project is the space shuttle, but if you’re building the accounts system for a bank, then that software is as critical to the bank as avionics are to the shuttle. Mission critical is not only an absolute (“lives at stake”) but also relative in terms of importance to the organization.
A very smart manager of mine once said “given a choice, developers will always choose to rewrite the code that is there to make it better”. What he meant was that taken from a pure engineering approach, developers would gladly rewrite a body of code in order to bring it up to modern levels. But the downside of this is multi-faceted. There’s an opportunity cost. There’s often an inability to clearly understand the full scope of the existing system. And of course, basic software engineering says that 10% of all code changes will yield regressions. Simply reworking code because the definition of well-architected changed might not always be prudent. The flip side of being modern is sometimes the creation of second system syndrome.
Changed notion of extensibility
All software systems with staying power have some notion of extensibility or a platform. While this could be as obvious as an API for system services, it could also be an add-in model, a wire protocol, or even file formats. Once your system introduces extensibility it becomes a platform. Someone, internal or external, will take advantage of your extensibility in ways you probably didn’t envision. You’ve got an instant legacy, but this legacy is now a dependency to external partners critical to your success.
In fact, your efforts at delivering goodness have quickly transformed someone else’s efforts. What was a feature to you can become a mission critical effort to your customer. This is almost always viewed as big win—who doesn’t want people depending on your software in this way. In fact, it was probably the goal to get people to bet their efforts on your extensibility. Success.
Until you want to change it. Then your attempts to move your platform forward are constrained by what put in place in the first version. And often your first version was truly a first version. All the understanding you had of what people wanted to do and what they would do are now informed by real experience. While you can do tons of early testing and pre-release work, a true platform takes a long time before it becomes clear where efforts at tapping extensibility will be focused.
During this time you might even find that the availability of one bit of extensibility caused customers to look at other parts of your system and invent their own extensibility or even exploit the extensibility you provided in ways you did not intend.
In fact whole industries can spring up based on pushing the limits of your extensibility: browser toolbars, social network games, startup programs.
Elements of your software system that are “undocumented implementation” get used by many for good uses. Reversed engineered file formats, wire protocols, or just hooking things at a low level all provide valuable functionality for data transfer, management, or even making systems accessible to users with special needs.
Taking it a step further, extensibility itself (documented or implied) becomes the surface area to exploit for those wishing to do evil things to your system or to use your system as a vector for evil.
What was once a beautiful and useful treasure can quickly turn into trash or worse. Of course if bad things are happening then you can seek to remove the surface area exposed by your system and even then you can be surprised at the backlash that comes. A really interesting example of this is back in 1999 when the “Melissa” virus exploited the automation in Outlook. The reaction was to disable the automation which broke a broad class of add-ins and ended up questioning the very notion of extensibility and automation in email. We’ve seen similar dynamics with viral gaming in social networks where the benefits are clear but once exploited the extensibility can quickly become a liability. Melissa was not a security hole at the time, but since then the notion of extensibility has been redefined and so systems with or utilizing such extensibility get viewed as legacy systems that need to be thought through.
While a system is being developed, there are scenarios and workflows that define the overall experience. Even with the best possible foresight, it is well-established that there is a high error rate in determining how a system will be used in the real world. Some of these errors are fairly gross but many are more nuanced, and depend on the context of usage. The more general purpose a system is the more likely it is to find the usage of a system to be substantially different from what it was designed to do. Conversely, the more task-oriented a system is the more likely it is to quickly see the mistakes or sub-optimal choices that got made.
Usage quickly gets to assumptions built into the system. List boxes designed to hold 100 names work well unless everyone has 1000 names in their lists. Systems designed for high latency networks behave differently when everyone has broadband. And while your web site might be great on a 15” laptop, one day you might find more people accessing it from a mobile browser with touch. These represent the rug being pulled out from under your usage assumptions. Your system implementation became legacy while people are just using it because they used it differently than you assumed.
At the same time, your views evolve on where you might want to take the system or experience. You might see new ways of input based on innovative technologies, new ways of organizing the functionality based on usage or increase in feature scope, or whole new features that change the flow of your system. These step-function changes are based on your role as designer of a system and evolving it to new usage scenarios.
Your view at the time when designing the changes is that you’re moving from the legacy system. Your customers think of the system as treasure. You view your change as the new treasure. Will your customers think of them as treasure or trash?
In these cases the legacy is visible and immediately runs into the risks of alienating those using your system. Changes will be dissected and debated among the core users (even for an internal system—ask the finance team how they like the new invoicing system, for example). Among breadth users the change will be just that, a change. Is the change a lot better or just a lot different? In your eyes or customer’s eyes? Are all customers the same?
We’re all familiar with the uproar that happens when user interface changes. Starting from the version upgrades of DOS classics like dBase or 1-2-3 through the most recent changes to web-based email search, or social networking, changing the user experience of existing systems to reflect new capabilities or usage is easily the most complex transformation existing, aka legacy, code must endure.
If you waded through the above examples of what might make existing code legacy code you might be wondering what in the world you can do? As you’ve come to expect from this blog, there’s no easy answer because the dynamics of product development are complex and the choices dependent upon more variables than you can “compute”. Product development is a system of linear equations with more variables than equations.
The most courageous efforts of software professionals involve moving systems forward. While starting with a clean slate is often viewed as brave and creative, the reality is that it takes a ton of bravery and creativity to decide how to evolve a system. Even the newest web service quickly becomes an enormous challenge to change—the combination of engineering complexities and potential for choosing “wrong” are enough to overwhelm any engineer. Anyone can just keep something running, but keeping something running while moving it to new and broader uses defines the excitement of product development.
Once you have a software system in place with customers/users, and you want to change some existing functionality there are a few options you can choose from.
- Remove code. Sometimes the legacy code can just be removed. The code represents functionality that should no longer be part of your system. Keeping in mind that almost no system has something totally unused, you’re going to run into speed bumps and resistance. While it is often easy to think of removing a feature, chances are there are architectural dependencies throughout a large system that depend on not just the feature but how it is implemented. Often the cost of keeping an implementation around is much lower than the perceived benefit from not having it. There’s an opportunity to make sure that the local desire to have fewer old lines of code to worry about is not trumping a global desire to maintain stability in the overall development process. On the other hand, there can be a high cost or impossibility to keeping the old code around. The code might not meet modern standards for privacy or security, even though it is not executed it exposes surface area that could be executed, for example.
- Run side by side. The most common refrain for any user-interface changes to existing code is to leave both implementations running and just allow a compatibility mode or switch to return to the old way of running. Because the view is that leaving around code is usually not so high cost it is often the case that those on the outside of a project view it as relatively low cost to leave old code paths around. As easy as this sounds, the old code path still has operational complexities (in the case of a service) and/or test matrix complexities that have real costs even if there is no runtime cost to those not accessing it (code not used doesn’t take up memory or drain power). The desire most web developers have to stop supporting older browsers is essentially this argument—keeping around the existing code is more trouble than it might be worth. Side by side is almost never a practical engineering alternative. From a customer point of view it seems attractive except inevitably the question becomes “how long can I keep running things the old way”. Something claimed to be a transition quickly turns into a permanent fixture. Sometimes that temporary ramp the urban planners put in becomes pretty popular. There’s a fun Harvard Business School case on the design of the Office Ribbon ($) that folks might enjoy since it tees up this very question.
- Rewrite underneath. When there are changes in architectural assumptions one approach is to just replumb the system. Developers love this approach. It is also enormously difficult. Implicit in taking this approach is that the rest of the system “above” will function properly in the face of a changed implementation underneath or that there is an obvious match from one generation of plumbing to another. While we all know good systems have abstractions and well-designed interfaces, these depend on characteristics of the underlying architecture. An example of this is what happens when you take advantage of a great architecture like file i/o and then change dramatically the characteristics of the system by using SSDs. While you want everything to just be faster, we know that the whole system depended on the latency and responsiveness of systems that operated an order of magnitude slower. It just isn’t as simple as rewriting—the changes will ripple throughout the system.
- Stage introduction. Given the complexities of both engineering and rolling out a change to customers, often a favored approach is the staged rollout. In this approach the changes are integrated over time through a series of more palatable changes. Perhaps there are architectural changes done first or perhaps some amount of existing functionality is maintained initially. Ironically, this brings us back to the implication that most businesses are the ones slow to change and have the most legacy. In fact, businesses most often employ the staged rollout of system changes. This seems to be the most practical. It doesn’t have the drama of a disruptive change or the apparent smoothness of a compatibility mode, and it does take longer.
Taking these as potential paths to manage transitions of existing code, one might get discouraged. It might even be that it seems like the only answer is to start over. When thinking through all the complexities of evolving a system, starting over, or rebooting, becomes appealing very quickly.
Dilemma of rebooting
Rebooting a system has a great appeal when faced with a complex system that is hard to manage, was architected for a different era, and is loaded with dated assumptions.
This is even more appealing when you consider that the disruption going on in the marketplace that is driving the need for a whole new approach is likely being led by a new competitor that has no existing customers or legacy. This challenge gets to the very heart of the innovator’s dilemma (or disruptive technologies). How can you respond when you’ve got a boat anchor of code?
Sometimes you can call this a treasure or an asset. Often you call them customers.
It is very easy to say you want to rewrite a system. The biggest challenge is in figuring out if you mean literally rewrite it or simply recast it. A rewrite implies that you will carry forth everything you previously had but somehow improved along the dimension driving the need to rework the system. This is impossibly hard. In fact it is almost impossible to name a total rewrite that worked without some major disruption, a big bet, and some sort of transition plan that was itself a major effort.
The dilemma in rewriting the system is the amount of work that goes into the transition. Most systems are not documented or characterized well-enough to even know if you have completely and satisfactorily rewritten it. The implications for releasing a system that you believe is functionally equivalent but turns out not to be are significant in terms if mismatched customer expectations. Even small parts of a system can be enormously complex to rewrite in the sense of bringing forward all existing functionality.
On the other hand, if you have a new product that recasts the old one, but along the lines of different assumptions or different characteristics then it is possible to set expectations correctly while you have time to complete the equivalent of a rewrite or while customers get used to what is missing. There are many challenges that come from implementing this approach as it is effectively a side-by-side implementation but for the entire product, not just part of the code.
Of course an alternative is just an entirely new product that is positioned to do different things well, even if it does some of the existing product. Again, this simply restates the innovator’s dilemma argument. The only difference is that you employ this for your own system.
The biggest frustration software folks have with the “build a new system that doesn’t quite do everything the old one did” is the immediate realization of what is missing. From mail clients to word processors to development tools and more, anything that comes along that is entirely new and modern is immediately compared to the status quo. This is enormously frustrating because of course as software people we are familiar with what is missing, just as we’re familiar with finite time and resources. It is even more interesting when the comparison is made to a competitor who only does new things in a modern way. Solid state storage is fast, reliable, and more. How often it was described as expensive and low capacity relative to 1TB spindle drives. Which storage are we using today—on our phones, tablets, pcs, and even in the cloud? Cost came down and capacities increased.
It is also just as likely that featured deemed missing in some comparison to the existing technology leader will prove to be less interesting as time goes by. Early laptops that lacked wired networking or RGB ports were viewed quite negatively. Today these just aren’t critical. It isn’t that networking or projection aren’t critical, but these have been recast in terms of implementation. Today we think of Wi-Fi or 4G along with technologies for wireless screen sharing, rather than wires for connectivity. The underlying scenario didn’t change, just a radical transformation of how it gets done.
This leads to the reality that systems will converge. While you might think “oh we’ll never need that again” there’s a good chance that even a newly recast, or reimagined, view of a system will quickly need to pick up features and capabilities previously developed.
One person’s treasure is another’s trash.
# # # # #