Context-Driven Performance Engineering

17thDec 2018 by Alex Podelko

ABOUT THE AUTHOR

Alex Podelko (@apodelko) has specialized in performance since 1997, working as a performance engineer and architect for several companies. Currently he is Consulting Member of Technical Staff at Oracle, responsible for performance testing and optimization of Enterprise Performance Management and Business Intelligence (a.k.a. Hyperion) products.

Alex periodically talks and writes about performance-related topics, advocating tearing down silo walls between different groups of performance professionals. His collection of performance-related links and documents (including his recent articles and presentations) can be found at alexanderpodelko.com. Alex currently serves as a director for the Computer Measurement Group (CMG), an organization of performance and capacity planning professionals.

We see a lot of interesting developments in performance engineering recently. Partly due to new technologies, partly due to a completely new scale of systems, partly due to a new level of integration and sophistication. However, it is concerning that the need for a holistic approach is rarely mentioned.

What is often missed is context. And context is king here, so I want to borrow shamelessly the tem “context-driven” as it appears to me to be a good fit for what we need in performance engineering. The context-driven approach to testing was introduced in its classical form in 2001. It became a loaded and politicized term in the functional testing community later – but all its original principles make perfect sense for performance testing and some for performance engineering in general.

First, the importance of context: “The value of any practice depends on its context.” That is what is usually missed. Whatever cool practice is shared (usually with very limited context – you would rather need to assume it from minor details) – people start to use it everywhere. While it may not make sense in another context or at least require significant adjustments.

Are we talking about a new system or a well-known existing system? What are load / performance / resilience requirements and patterns? What technologies are used? What is the cost of failures and performance issues? How homogenous is load? What are available skills and resources? These and many other aspects define what combination of practices should be used and what a specific practice can bring to the table.

Second, best practices: “There are good practices in context, but there are no best practices.” I wouldn’t follow here people who get it to extreme fighting any usage of the term “best practice” and interpreting it in its literal grammatical meaning as “best”. As a standard industry term, “best practices” long lost its literal meaning (as “hot dog” doesn’t actually mean “hot” “dog”) and, by my opinion, does mean “good practices in context” – but, unfortunately, that real meaning often gets lost.

Different Methods of Performance Risk Mitigation

I advocated a holistic approach to performance in my 2017 Performance Calendar post Shift Left, Shift Right – Is It Time for a Holistic Approach?. Here is a simplified list of performance methods from there:

Software Performance Engineering (SPE) or Performance Design. Everything that helps in selecting appropriate architecture and design and proving that it will scale according to our needs. Starting with performance requirements and including, for example, performance patterns and anti-patterns, scalable architectures, and modeling.
Single-User Performance Engineering. Everything that helps to ensure that single-user response times match expectations. Including profiling, tracking and optimization of single-user performance, and Web Performance Optimization (WPO). While these methods may prevent a lot of performance issues – they don’t cover multi-user and scalability aspects.
Monitoring / APM / Tracing / Log Analysis. Everything that provides insights in what is going on inside the working system and tracks down performance issues and trends. While Application Performance Management / Monitoring (APM) and Real-User Monitoring (RUM) are real game-changers here providing tons of detailed information, they still mainly remain reactive approaches. They help to understand why issues happened – but after they did.
Load Testing. Everything used for testing systems under any multi-user load (including all other variations of multi-user testing, such as performance, concurrency, stress, endurance, longevity, scalability, reliability, and similar).
Capacity Planning / Management. Everything that ensures that we will have enough resources for the system, including both people-driven approaches and automatic self-management such as auto-scaling.
Reliability / Resilience Engineering. Including Site Reliability Engineering (SRE). Everything that ensure that the system will work even if a failure happens.
System Lifecycle. Continuous Integration / Delivery / Deployment allows quick deployment and removal of changes, thus decreasing the impact of performance issues. Agile / iterative development helps to find issues early, thus decreasing costs of fixing them.

All the above do not exist not in a vacuum, but on top of high-priority functional requirements and resource constraints (including time, money, skills, etc.). Of course, it is just a cursory list to illustrate that we have many ways to mitigate performance risk – which may somewhat overlap and somewhat complement each other.

History

“Fix-it-later was a viable approach …, but today, the original premises no longer hold – and fix-it-later is archaic and dangerous. The original premises were:

Performance problems are rare.
Hardware is fast and inexpensive.
It’s too expensive to build responsive software.
You can tune software later, if necessary.”

Have you heard something like this recently? That is a quote from Dr. Connie Smith’s Performance Engineering of Software Systems, published in 1990. The book presented the foundations of software performance engineering, and already had 15 pages of bibliography on the subject.

It is surprising that with all the changes happening around, underlying challenges of performance engineering remain the same. Check A Short History of Performance Engineering for more to see that the main lesson of the history is that the feeling that we are close to solving performance problems has existed for the last 50+ years, and it will probably stay with us for a whileâ€”so instead of hoping for a silver bullet, it is better to understand different existing approaches to mitigating performance risks and find an optimal combination of them to address performance risks in your particular context.

It is interesting to look how handling performance changed with time. Probably performance engineering went beyond single-user profiling when mainframes started to support multitasking, forming as a separate discipline in 1960-s. It was mainly batch loads with sophisticated ways to schedule and ration consumed resources as well as pretty powerful OS-level instrumentation allowing to track down performance issues. The cost of mainframe resources was high, so there were capacity planners and performance analysts to optimize mainframe usage.

Then the paradigm shifted to client-server and distributed systems. Available operating systems didn’t have almost any instrumentation or workload management capabilities, so load testing became almost only remedy in addition to system-level monitoring to handle multi-user performance. Deploying across multiple machines was more difficult and the cost of rollback was significant, especially for Commercial Of-The-Shelf (COTS) software that might be deployed by thousands of customers. Load testing became probably the main way to ensure performance of distributed systems and performance testing groups became the centers of performance-related activities in many organizations.

Now we got another paradigm shift – to web / cloud. While cloud looks quite different from mainframes, there are many similarities between them, especially from the performance point of view. Such as availability of computer resources to be allocated, an easy way to evaluate the cost associated with these resources and implement chargeback, isolation of systems inside a larger pool of resources, easier ways to deploy a system and pull it back if needed without impacting other systems.

However, there are notable differences and they make managing performance in cloud more challenging. First of all, there is no instrumentation on the OS level and even resource monitoring becomes less reliable. So instrumentation should mostly be on the application level. Second, systems are not completely isolated from the performance point of view and they could impact each other (and even more so when we talk about containers). And, of course, we mostly have multi-user interactive workloads that are difficult to predict and manage. That means that such performance risk mitigation approaches as APM, performance testing, and capacity management are very important in cloud.

Performance Engineering Strategy

The art of performance engineering is to find out the best strategy of combining different methods to mitigate performance risks to optimize risk mitigation to costs ratio for, of course, a specific context.

It appears that reactive methods got more popularity recently as a lot of cool projects were popularized by today’s iconic companies in such areas as A/B testing, tracing, resilient engineering, SRE, and many others. Many are indeed very impressive. However, they should be understood in full context – which you often need to deduct from rather vague hints.

It is great if the suggested approach works in your context. But many most interesting developments happened in large-scale Internet projects, such as social networks. The projects of enormous scale – but limited risk in case if any specific record won’t get updated in time or something will not be shown for a specific request. It may be quite opposite for financial systems, where scale may be smaller – but every single transaction is very important.

Let’s look at A/B testing as discussed in Load Testing at Netflix: Virtual Interview with Coburn Watson. As explained there, Netflix was very successful in using canary testing in some cases instead of load testing. Actually, canary testing is the performance testing that uses real users to create load instead of creating synthetic load by a load testing tool. It makes sense when 1) you have very homogenous workloads and can control them precisely 2) potential issues have minimal impact on user satisfaction and company image and you can easily rollback the changes 3) you have fully parallel and scalable architecture. That was the case with Netflix – they just traded in the need to generate (and validate) workload for a possibility of minor issues and minor load variability. But the further you are away from these conditions; the more questionable such practice would be.

The need for pro-active methods is especially clear for new systems. If you are creating a new system, you’d better use proactive methods such as performance testing and modeling to make sure that the system will perform as expected. The classic example of what may happen otherwise is the disastrous rollout of HealthCare.gov.

Moreover, when you hear about a new sexy technique, it is usually about the technique itself – even if some context is given. It usually doesn’t mean that it has replaced all other proven, mature techniques – it usually rather complements them in new or most challenging areas (again, up to what degree it complements and up to what degree it replaces depend on context). And these proven, mature techniques are still used in most companies – but there is almost no publicity and it is difficult to find out what combination of techniques is used in each case. That is a real challenge of performance engineering – that you need to develop you own strategy utilizing all appropriate methods in combination optimal for your context with rather limited information on how others do it.

Performance Testing

Let’s, for example, look at performance testing. While it remains rather in shadow nowadays, it is a very important method of proactive mitigating performance risk. It re-invents itself as new development methodologies drastically widened context and available options.

Traditional load testing (optimized for the waterfall software development process) was focused, basically, on one context – pre-release production-like – so the main goal was to make load and environment as similar to production as possible. Well, with some variation – such as stress, spike, uptime/endurance/longevity and other kinds of performance testing, still mainly based on realistic workloads.

I recall that when I shared a very good white paper Rapid Bottleneck Identification – A Better Way to do Load Testing (here dated 2005), I was slammed by one renown expert as encouraging a wrong way to do load testing. Well, maybe the name should be modified to include the context – A Better Way to do Load Testing for Bottleneck Identification – but indeed complex realistic workloads are not optimal for many performance engineering tasks we have earlier in the development lifecycle (which are significantly more important nowadays as we indeed can do performance testing early).

Drastic changes in the industry in recent years significantly expanded the performance testing horizon – agile development and cloud computing probably the most. I attempted to summarize the changes in Reinventing Performance Testing. Basically, instead of the single way of doing performance testing (and all other were considered rather exotic), we have a full spectrum of different tests which can be done at different moments – so deciding what and when to test became a very non-trivial task heavily depending on the context.

For example, let’s consider test environment: options nowadays include traditional internal (and external) labs; cloud as â€˜Infrastructure as a Service’ (IaaS), when some parts of the system or everything are deployed there; and service, cloud as â€˜Software as a Service (SaaS)’, when vendors provide load testing service. There are some advantages and disadvantage of each model. Depending on specific goals and systems to test, one deployment model may be preferred over another.
For example, to check the effect of performance improvement (performance optimization), using an isolated lab environment may be a better option to see even small variations introduced by a change. To test the whole production environment end-to-end to make sure that the system will handle load without any major issue, testing from the cloud or a service may be more appropriate. To create a production-like test environment without going bankrupt, moving everything to the cloud for periodical performance testing may be a solution. For comprehensive performance testing, you probably need to use several approaches – for example, lab testing (for performance optimization to get reproducible results) and distributed, realistic outside testing (to check real-life issues you can’t simulate in the lab). Limiting yourself to one approach limits the risks you will mitigate.

Agile development eliminates the main problem of traditional development: you need to have a working system before you may test it, so performance testing happened at the last moment. While it was always recommended to start performance testing early, it was usually rather few activities you can do before the system is ready. Now, with agile development, we can indeed start testing early – thus opening a plethora of new options (and challenges), such as continuous and exploratory performance testing.

There was an interesting discussion about what continuous performance testing solves and what it doesn’t after my presentation at Performance Advisory Council (PAC) Continuous Performance Testing: Myths and Realities. And it became obvious for me that performance testing in general and specific performance testing techniques should be considered in full context – including environments, products, teams, issues, goals, budgets, timeframes, risks, etc. The question is not what technique is better – the question is what technique (or what combination of techniques) to use in particular case (or, in more traditional wording, what performance testing strategy should be).

The purpose of continuous performance testing is, basically, regression performance testing. Checking that no unexpected performance degradations happened between tests (and verify expected performance changes on the established baseline). It may start early (although it may be a bigger challenge on very early stages) – and probably should continue as soon as any changes happen to the system. It may be on a component level or on a system level (considering that not all functionality of the system is available in the beginning). Theoretically, it may be even full-scale system-level realistic tests – but it doesn’t make sense in most contexts.

For continuous performance testing we rather want short, limited scale, and fully reproducible tests (which mean minimal randomness) – so if results are different, we know that it is due to a system change. For full-scale system-level tests, to check if the system handle the expected load, we are more concerned to make sure that the workload and the system would be as close to the real life as possible – and less concerned with small variations in performance results. It doesn’t mean one is better than another – they are different tests mitigating different performance risks. There is some overlap between them as they both target performance risks – but continuous testing usually doesn’t test system’s limit and full-scale realistic tests are not good to track small differences between builds.

Web Performance Calendar