Performance: See the Whole Picture

4thDec 2012 by Alexander Podelko

ABOUT THE AUTHOR

The last fourteen years Alex Podelko (@apodelko) worked as a performance engineer and architect for several companies. Currently he is Consulting Member of Technical Staff at Oracle, responsible for performance testing and optimization of Hyperion products. Alex currently serves as a director for the Computer Measurement Group (CMG). He maintains a collection of performance-related links and documents.

Performance Puzzle

There are many discussions about performance, but they often concentrate on only one specific facet of performance. The main problem with that is that performance is the result of every design and implementation detail, so you can’t ensure performance approaching it from a single angle only.

There are different approaches and techniques to alleviate performance risks, such as:

Single-User Performance Engineering. Everything that helps to ensure that single-user response times, the critical performance path, match our expectations. Including profiling, tracking and optimization of single-user performance, and Web Performance Optimization (WPO).
Software Performance Engineering (SPE). Everything that helps in selecting appropriate architecture and design and proving that it will scale according to our needs. Including performance patterns and anti-patterns, scalable architectures, and modeling.
Instrumentation / Application Performance Management (APM)/ Monitoring. Everything that provides insights in what is going on inside the working system and tracks down performance issues and trends.
Capacity Planning / Management. Everything that ensures that we will have enough resources for the system. Including both people-driven approaches and automatic self-management such as auto-scaling.
Load Testing. Everything used for testing the system under any multi-user load (including all other variations of multi-user testing, such as performance, concurrency, stress, endurance, longevity, scalability, reliability, and similar).
Continuous Integration / Delivery / Deployment. Everything allowing quick deployment and removal of changes, decreasing the impact of performance issues.

And, of course, all the above does not exist not in a vacuum, but on top of high-priority functional requirements and resource constraints (including time, money, skills, etc.).

Every approach or technique mentioned above somewhat mitigates performance risks and improves chances that the system will perform up to expectations. However, none of them guarantees that. And, moreover, none completely replaces the others, as each one addresses different facets of performance.

To illustrate that let’s look, for example, at load testing. Recent trends of agile development, DevOps, lean startup, and web operations somewhat question importance of load testing. Some (not many) are openly saying that they don’t need load testing while others are still paying lip service to it – but just never get there. In more traditional corporate world we still see performance testing groups and important systems getting load tested before deployment.

A Closer Look at Load Testing

Yes, the other ways to mitigate performance risks mentioned above definitely can decrease performance risk comparing to situations where nothing is done about performance at all until the last moment before rolling out the system in production without any instrumentation, but they still leave risks of crashing and performance degradation under multi-user load. There are always risks of crashing a system or experiencing performance issues under heavy load – and the only way to mitigate them is to actually test the system. Even stellar performance in production and a highly scalable architecture don’t guarantee that it won’t crash with a slightly higher load. Moreover, load testing doesn’t completely guarantee it too (for example, real-life workload may be different from what you have tested), but it significantly decreases the risk.

Another important value of load testing is checking how changes impact multi-user performance. The impact on multi-user performance is not usually proportional to what you see with single-user performance and often may be counterintuitive; sometimes single-user performance improvement may lead to multi-user performance degradation. And the more complex the system is, the more likely exotic multi-user performance issues can be. Load testing provides the reliable and reproducible way to apply multi-user load needed for performance optimization and performance troubleshooting.

It may be possible to survive without load testing by using other ways to mitigate performance risks if the cost of performance issues and downtime is low. However, it actually means that you use customers and/or users to test your system, addressing only those issues that pop up; this approach become risky once performance and downtime start to matter.

Moreover, with existing trends of system self-regulation (such as auto-scaling or changing the level of services depending on load), load testing is needed to verify that functionality. You need to apply heavy load to see how auto-scaling will work. So load testing becomes a way to test functionality of the system, blurring the traditional division between functional and nonfunctional testing.

Historical View

It is interesting to look how handling performance changed with time. Probably the performance went beyond single-user profiling when mainframes started to support multiprogramming. It was mainly batch loads with sophisticated ways to schedule and ration consumed resources as well as pretty powerful OS-level instrumentation allowing to track down performance issues. The cost of mainframe resources was high, so there were capacity planners and performance analysts to optimize mainframe usage.

Then the paradigm changed to client-server and distributed systems. Available operating systems didn’t have almost any instrumentation and workload management capabilities, so load testing became almost only remedy in addition to system-level monitoring to handle multi-user performance. Deploying across multiple machines was more difficult and the cost of rollback was significant, especially for Commercial Of-The-Shelf (COTS) software which may be deployed by thousands of customers. Load testing became probably the main way to ensure performance of distributed systems and performance testing groups became the centers of performance-related activities in many organizations.

While cloud looks quite different from mainframes, there are many similarities between them, especially from the performance point of view. Such as availability of computer resources to be allocated, an easy way to evaluate the cost associated with these resources and implement chargeback, isolation of systems inside a larger pool of resources, easier ways to deploy a system and pull it back if needed without impacting other systems.

However there are notable differences and they make managing performance in cloud more challenging. First of all, there is no instrumentation on the OS level and even resource monitoring becomes less reliable. So all instrumentation should be on the application level. Second, systems are not completely isolated from the performance point of view and they could impact each other. And, of course, we mostly have multi-user interactive workloads which are difficult to predict and manage. That means that such performance risk mitigation approaches as APM, load testing, and capacity management are very important in cloud.

It is interesting that while performance is the result of all design and implementation details, performance engineering area remains very siloed. Those who do capacity planning are usually not involved much in performance testing or software performance engineering. The new and fastest growing group, web performance specialists, remains mainly isolated from other performance-related groups. People and organizations trying to span all performance-related activities together are few and far apart.

I don’t see that the need that need for specific performance-related expertise, such as load testing or capacity planning, is going away. Even in case of web operations, we would probably see load testing coming back as soon as systems become more complex and performance issues start to hurt business. There perhaps would be less need for “performance testers” as it was at the heyday due to better instrumenting, APM tools, continuous integration, resource availability, etc. – but I’d expect more need for performance experts who would be able to see the whole picture using all available tools and techniques.

Web Performance Calendar